home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-10-21 | 169.3 KB | 3,532 lines |
- Internet Engineering Task Force Audio-Video Transport Working Group
- INTERNET-DRAFT H. Schulzrinne
- draft-ietf-avt-issues-01.txt AT&T Bell Laboratories
- October 20, 1993
- Expires: 03/01/94
-
- Issues in Designing a Transport Protocol for Audio and Video Conferences and
- other Multiparticipant Real-Time Applications
-
-
- Status of this Memo
-
-
- This document is an Internet Draft. Internet Drafts are working documents
- of the Internet Engineering Task Force (IETF), its Areas, and its Working
- Groups. Note that other groups may also distribute working documents as
- Internet Drafts.
-
- Internet Drafts are draft documents valid for a maximum of six months.
- Internet Drafts may be updated, replaced, or obsoleted by other documents
- at any time. It is not appropriate to use Internet Drafts as reference
- material or to cite them other than as a ``working draft'' or ``work in
- progress.''
-
- Please check the I-D abstract listing contained in each Internet Draft
- directory to learn the current status of this or any other Internet Draft.
-
- Distribution of this document is unlimited.
-
-
- Abstract
-
- This memorandum is a companion document to the current version
- of the RTP protocol specification draft-ietf-avt-rtp-*.{txt,ps}.
- It discusses aspects of transporting real-time services (such as
- voice or video) over the Internet. It compares and evaluates
- design alternatives for a real-time transport protocol, providing
- rationales for the design decisions made for RTP. Also covered are
- issues of port assignment and multicast address allocation. A
- comprehensive glossary of terms related to multimedia conferencing
- is provided.
-
-
- This document is a product of the Audio-Video Transport working group within
- the Internet Engineering Task Force. Comments are solicited and should be
- addressed to the working group's mailing list at rem-conf@es.net and/or the
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- author(s).
-
-
- Contents
-
-
- 1 Introduction 4
-
- 1.1 T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
-
- 2 Goals 7
-
- 3 Services 9
-
- 3.1 Duplex or Simplex? . . . . . . . . . . . . . . . . . . . . . . . . 12
-
- 3.2 Framing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
-
- 3.3 Version Identification . . . . . . . . . . . . . . . . . . . . . . 14
-
- 3.4 Conference Identification. . . . . . . . . . . . . . . . . . . . . 14
-
- 3.4.1Demultiplexing. . . . . . . . . . . . . . . . . . . . . . . . . 15
-
- 3.4.2Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
-
- 3.5 Media Encoding Identification. . . . . . . . . . . . . . . . . . . 16
-
- 3.5.1Audio Encodings . . . . . . . . . . . . . . . . . . . . . . . . 17
-
- 3.5.2Video Encodings . . . . . . . . . . . . . . . . . . . . . . . . 19
-
- 3.6 Playout Synchronization. . . . . . . . . . . . . . . . . . . . . . 19
-
- 3.6.1Synchronization Methods . . . . . . . . . . . . . . . . . . . . 21
-
- 3.6.2Detection of Synchronization Units. . . . . . . . . . . . . . . 22
-
- 3.6.3Interpretation of Synchronization Bit . . . . . . . . . . . . . 24
-
- 3.6.4Interpretation of Timestamp . . . . . . . . . . . . . . . . . . 25
-
- 3.6.5End-of-talkspurt indication . . . . . . . . . . . . . . . . . . 29
-
- 3.6.6Recommendation. . . . . . . . . . . . . . . . . . . . . . . . . 30
-
- 3.7 Segmentation and Reassembly. . . . . . . . . . . . . . . . . . . . 30
-
- 3.8 Source Identification. . . . . . . . . . . . . . . . . . . . . . . 31
-
-
- H. Schulzrinne Expires 03/01/94 [Page 2]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.8.1Bridges, Translators and End Systems. . . . . . . . . . . . . . 31
-
- 3.8.2Address Format Issues . . . . . . . . . . . . . . . . . . . . . 33
-
- 3.8.3Globally unique identifiers . . . . . . . . . . . . . . . . . . 34
-
- 3.8.4Locally unique addresses. . . . . . . . . . . . . . . . . . . . 35
-
- 3.9 Energy Indication. . . . . . . . . . . . . . . . . . . . . . . . . 37
-
- 3.10Error Control. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
-
- 3.11Security and Privacy . . . . . . . . . . . . . . . . . . . . . . . 39
-
- 3.11.1Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . 39
-
- 3.11.2Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . 40
-
- 3.11.3Message Integrity and Authentication. . . . . . . . . . . . . . 41
-
- 3.12Security for RTP vs. PEM. . . . . . . . . . . . . . . . . . . . . 42
-
- 3.13Quality of Service Control . . . . . . . . . . . . . . . . . . . . 44
-
- 3.13.1QOS Measures. . . . . . . . . . . . . . . . . . . . . . . . . . 44
-
- 3.13.2Remote measurements . . . . . . . . . . . . . . . . . . . . . . 45
-
- 3.13.3Monitoring by Third Party . . . . . . . . . . . . . . . . . . . 46
-
- 4 Conference Control Protocol 46
-
- 5 The Use of Profiles 46
-
- 6 Port Assignment 47
-
- 7 Multicast Address Allocation 48
-
- 7.1 Channel Sensing. . . . . . . . . . . . . . . . . . . . . . . . . . 49
-
- 7.2 Global Reservation Channel with Scoping. . . . . . . . . . . . . . 50
-
- 7.3 Local Reservation Channel. . . . . . . . . . . . . . . . . . . . . 50
-
- 7.3.1Hierarchical Allocation with Servers. . . . . . . . . . . . . . 51
-
- 7.3.2Distributed Hierarchical Allocation . . . . . . . . . . . . . . 51
-
- 7.4 Restricting Scope by Limiting Time-to-Live . . . . . . . . . . . . 52
-
- H. Schulzrinne Expires 03/01/94 [Page 3]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 8 Security Considerations 52
-
- A Glossary 52
-
- B Address of Author 62
-
-
- 1 Introduction
-
-
- This memorandum
-
-
- 1.1 T
-
-
- he transport protocol for real-time applications (RTP) discussed in the pr
- this memorandum aims to provide services commonly required by interactive
- multimedia conferences, such as playout synchronization, demultiplexing,
- media identification and active-party identification. However, RTP is not
- restricted to multimedia conferences; it is anticipated that other real-time
- services such as remote data acquisition and control may find its services
- of use.
-
- In this context, a conference describes associations that are characterized
- by the participation of two or more agents, interacting in real time
- with one or more media of potentially different types. The agents are
- anticipated to be human, but may also be measurement devices, remote media
- servers, simulators and the like. Both two-party and multiple-party
- associations are to be supported, where one or more agents can take active
- roles, i.e., generate data. Thus, applications not commonly considered a
- conference fall under this wider definition, for example, one-way media such
- as the network equivalent of closed-circuit television or radio, traditional
- two-party telephone conversations or real-time distributed simulations.
- Even though intended for real-time interactive applications, the use of
- RTP for the storage and transmission of recorded real-time data should be
- possible, with the understanding that the interpretation of some fields such
- as timestamps may be affected by this off-line mode of operation.
-
- RTP uses the services of an end-to-end transport protocol such as UDP,
- TCP, OSI TP1 or TP4, ST-II or the like(1) . The services used are:
- end-to-end delivery, framing, demultiplexing and multicast. The underlying
- network is not assumed to be reliable and can be expected to lose, corrupt,
- arbitrarily delay and reorder packets. However, the use of RTP within
- ------------------------------
- 1. ST-II is not properly a transport protocol, as it is visible to
- intermediate nodes, but it provides services such as process demultiplexing
- commonly associated with transport protocols.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 4]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- quality-of-service (e.g., rate) controlled networks is anticipated to be of
- particular interest. Network layer support for multicasting is desirable,
- but not required. RTP is supported by a real-time control protocol (RTCP)
- in a relationship similar to that between IP and ICMP. However, RTP can be
- used, with reduced functionality, without a control protocol. The control
- protocol RTCP provides minimum functionality for maintaining conference
- state for one or more flows within a single transport association. RTCP
- is not guaranteed to be reliable; each participant simply sends the local
- information periodically to all other conference participants.
-
- As an alternative, RTP could be used as a transport protocol layered
- directly on top of IP, potentially increasing performance and reducing
- header overhead. This may be attractive as the services provided by UDP,
- checksumming and demultiplexing, may not be needed for multicast real-time
- conferencing applications. This aspect remains for further study. The
- relationships between RTP and RTCP to other protocols of the Internet
- protocol suite are depicted in Fig. 1.
-
- +--------------------------+-----------------------------+
- | | conference controller |
- | media application |-------------------+ |
- | | conf. ctl. prot. | |
- +--------------------------+-------------------+---------+
- | | RTCP | |
- | +-------------------+ |
- | RTP |
- +--------+-----------------+ |
- | | UDP | |
- | ST-II +-----------------+-------------+ |
- | | IP | |
- +--------------------------------------------------------+
- | AAL5 |
- +--------------------------------------------------------+
- Figure 1: Embedding of RTP and RTCP in Internet protocol stack
-
- Conferences encompassing several media are managed by a (reliable)
- conference control protocol, whose definition is outside the scope of this
- note. Some aspects of its functionality, however, are described in
- Section 4.
-
- Within this working group, some common encoding rules and algorithms for
- media have been specified, keeping in mind that this aspect is largely
- independent of the remainder of the protocol. Without this specification,
- interoperability cannot be achieved. It is intended, however, to keep
- the two aspects as separate RFCs as changes in media encoding should
- be independent of the transport aspects. The encoding specification
- includes issues such as byte order for multi-byte samples, sample order
- for multi-channel audio, the format of state information for differential
- encodings, the segmentation of encoded video frames into packets, and the
-
-
- H. Schulzrinne Expires 03/01/94 [Page 5]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- like.
-
- When used for multimedia services, RTP sources will have to be able to
- convey the type of media encoding used to the receivers. The number
- of encodings potentially used is rather large, but a single application
- will likely restrict itself to a small subset of that. To allow the
- participants in conferences to unambiguously communicate to each other the
- current encoding, the working group is defining a set of encoding names to
- be registered with the Internet Assigned Numbers Authority (IANA). Also,
- short integers for a default mapping of common encodings are specified.
-
- The issue of port assignment will be discussed in more detail in Section 6.
- It should be emphasized, however, that UDP port assignment does not imply
- that all underlying transport mechanisms share this or a similar port
- mechanism.
-
- This memorandum aims to summarize some of the discussions held within the
- audio-video transport (AVT) working group chaired by Stephen Casner, but
- the opinions are the author's own. Where possible, references to previous
- work are included, but the author realizes that the attribution of ideas is
- far from complete. The memorandum builds on operational experience with
- Van Jacobson's and Steve McCanne's vat audio conferencing tool as well as
- implementation experience with the author's Nevot network voice terminal.
- This note will frequently refer to NVP [1], the network voice protocol,
- a protocol used in two versions for early Internet wide-area packet voice
- experiments. CCITT has standardized as recommendations G.764 and G.765
- a packet voice protocol stack for use in digital circuit multiplication
- equipment.
-
- The name RTP was chosen to reflect the fact that audio and video
- conferences may not be the only applications employing its services, while
- the real-time nature of the protocol is important, setting it apart from
- other multimedia-transport mechanisms, such as the MIME multimedia mail
- effort [2].
-
- The remainder of this memorandum is organized as follows. Section 2
- summarizes the design goals of this real-time transport protocol. Then,
- Section 3 describes the services to be provided in more detail. Section 4
- briefly outlines some of the services added by a higher-layer conference
- control protocol; a more detailed description is outside the scope of
- this document. Two appendices discuss the issues of port assignment and
- multicast address allocation, respectively. A glossary defines terms and
- acronyms, providing references for further detail. The actual protocol
- specification embodying the recommendation and conclusions of this report is
- contained in a separate document.
-
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 6]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 2 Goals
-
-
- Design decisions should be measured against the following goals, not
- necessarily listed in order of importance:
-
-
- content flexibility: While the primary applications that motivate the
- protocol design are conference voice and video, it should be
- anticipated that other applications may also find the services
- provided by the protocol useful. Some examples include distribution
- audio/video (for example, the ``Radio Free Ethernet''application by
- Sun), distributed simulation and some forms of (loss-tolerant) remote
- data acquisition (for example, active badge systems [3,4]). Note that
- it is possible that the same packet header field may be interpreted in
- different ways depending on the content (e.g., a synchronization bit
- may be used to indicate the beginning of a talkspurt for audio and the
- beginning of a frame for video). Also, new formats of established
- media, for example, high-quality multi-channel audio or combined audio
- and video sources, should be anticipated where possible.
-
- extensible: Researchers and implementors within the Internet community are
- currently only beginning to explore real-time multimedia services such
- as video conferences. Thus, the RTP should be able to incorporate
- additional services as operational experience with the protocol
- accumulates and as applications not originally anticipated find its
- services useful. The same mechanisms should also allow experimental
- applications to exchange application-specific information without
- jeopardizing interoperability with other applications. Extensibility
- is also desirable as it will hopefully speed along the standardization
- effort, making the consequences of leaving out some group's favorite
- fixed header field less drastic.
-
- It should be understood that extensibility and flexibility may conflict
- with the goals of bandwidth and processing efficiency.
-
- independent of lower-layer protocols: RTP should make as few assumptions
- about the underlying transport protocol as possible. It should, for
- example, work reasonably well with UDP, TCP, ST-II, OSI TP, VMTP and
- experimental protocols, for example, protocols that support resource
- reservation and quality-of-service guarantees. Naturally, not all
- transport protocols are equally suited for real-time services; in
- particular, TCP may introduce unacceptable delays over anything but
- low-error-rate LANs. Also, protocols that deliver streams rather than
- packets needs additional framing services as discussed in Section 3.2.
-
- It remains to be discussed whether RTP may use services provided by the
- lower-layer protocols for its own purposes (time stamps and sequence
- numbers, for example).
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 7]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- The goal of independence from lower-layer considerations also affects
- the issue of address representation. In particular, anything too
- closely tied to the current IP 4-byte addresses may face early
- obsolescence. It is to be anticipated, however, that experience gained
- will suggest a new protocol revision in any event by that time.
-
- bridge-compatible: Operational experience has shown that RTP-level bridges
- are necessary and desirable for a number of reasons. First, it
- may be desirable to aggregate several media streams into a single
- stream and then retransmit it with possibly different encoding, packet
- size or transport protocol. A packet ``translator'' that achieves
- multicasting by user-level copying may be needed where multicast
- tunnels or IP connectivity are unavailable or the end-systems are not
- multicast-capable.
-
- bandwidth efficient: It is anticipated that the protocol will be used in
- networks with a wide range of bandwidths and with a variety of media
- encodings. Despite increasing bandwidths within the national backbone
- networks, bandwidth efficiency will continue to be important for
- transporting conferences across 56 kb links, office-to-home high-speed
- modem connections and international links. To minimize end-to-end
- delay and the effect of lost packets, packetization intervals have to
- be limited, which, in combination with efficient media encodings, leads
- to short packet sizes. Generally, packets containing 16 to 32 ms of
- speech are considered optimal [5--7]. For example, even with a 65 ms
- packetization interval, a 4800 b/s encoding produces 39 byte packets.
- Current Internet voice experiments use packets containing around 20 ms
- of audio, which translates into 160 bytes of audio information coded
- at 64 kb/s. Video packets are typically much longer, so that header
- overhead is less of a concern.
-
- For UDP multicast (without counting the overhead of source routing as
- currently used in tunnels or a separate IP encapsulation as planned),
- IPv4 incurs 20 bytes and UDP an additional 8 bytes of header overhead,
- to which datalink layer headers of at least 4 bytes must be added.
- With RTP header lengths between 4 and 8 bytes, the total overhead
- amounts to between 36 and 40 (or more) bytes per audio or video packet.
- For 160-byte audio packets, the overhead of 8-byte RTP headers together
- with UDP, IP and PPP (as an example of a datalink protocol) headers is
- 25%. For low bitrate coding, packet headers can easily double the
- necessary bit rate.
-
- Thus, it appears that any fixed headers beyond eight bytes would have
- to make a significant contribution to the protocol's capabilities as
- such long headers could stand in the way of running RTP applications
- over low-speed links. The current fixed header lengths for NVP and
- vat are 4 and 8 bytes, respectively. It is interesting to note that
- G.764 has a total header overhead, including the LAPD data link layer,
- of only 8 bytes, as the voice transport is considered a network-layer
- protocol. The overhead is split evenly between layers 2 and 3.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 8]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- Bandwidth efficiency can be achieved by transporting non-essential or
- slowly changing protocol state in optional fields or in a separate
- low-bandwidth control protocol. Also, header compression [8] may be
- used.
-
- international: Even now, audio and video conferencing tools are used far
- beyond the North American continent. It would seem appropriate to give
- considerations to internationalization concerns, for example to allow
- for the European A-law audio companding and non-US-ASCII character sets
- in textual data such as site identification.
-
- processing efficient: With arrival rates of on the order of 40 to 50
- packets per second for a single voice or video source, per-packet
- processing overhead may become a concern, particularly if the
- protocol is to be implemented on other than high-end workstations.
- Multiplication and division operations should be avoided where possible
- and fields should be aligned to their natural size, i.e., an n-byte
- integer is aligned on an n-byte multiple, where possible.
-
- implementable now: Given the anticipated lifetime and experimental nature
- of the protocol, it must be implementable with current hardware and
- operating systems. That does not preclude that hardware and operating
- systems geared towards real-time services may improve the performance
- or capabilities of the protocol, e.g., allow better intermedia
- synchronization.
-
-
- 3 Services
-
-
- The services that may be provided by RTP are summarized below. Note that
- not all services have to be offered. Services anticipated to be optional
- are marked with an asterisk.
-
-
- o framing (*)
-
- o demultiplexing by conference/association (*)
-
- o demultiplexing by media source
-
- o demultiplexing by conference
-
- o determination of media encoding
-
- o playout synchronization between a source and a set of destinations
-
- o error detection (*)
-
- o encryption (*)
-
- H. Schulzrinne Expires 03/01/94 [Page 9]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- o quality-of-service monitoring (*)
-
-
- In the following sections, we will discuss how these services are reflected
- in the proposed packet header. Information to be conveyed within the
- conference can be roughly divided into information that changes with every
- data packet and other information that stays constant for longer time
- periods. State information that does not change with every packet can be
- carried in several different ways:
-
-
- as a fixed part of the RTP header: This method is easiest to decode and
- ensures state synchronization between sender and receiver(s), but can
- be bandwidth inefficient or restrict the amount of state information to
- be conveyed.
-
- as a header option: The information is only carried when needed. It
- requires more processing by the sending and receiving application. If
- contained in every packet, it is also less bandwidth-efficient than the
- first method.
-
- within RTCP packets: This approach is roughly equivalent to header options
- in terms of processing and bandwidth efficiency. Some means of
- identifying when a particular option takes effect within the data
- stream may have to be provided.
-
- within a multicast conference announcement: Instead of residing at a well-
- known conference server, information about on-going or upcoming
- conferences may be multicast to a well-known multicast address.
-
- within conference control: The state information is conveyed when the
- conference is established or when the information changes. As for RTCP
- packets, a synchronization mechanism between data and control may be
- required for certain information.
-
- through a conference directory: This is a variant of the conference control
- mechanism, with a (distributed) directory at a well-known (multicast)
- address maintaining state information about on-going or scheduled
- conferences. Changing state information during a conference is
- probably more difficult than with conference control as participants
- need to be told to look at the directory for changed information.
- Thus, a directory is probably best suited to hold information that will
- persist through the life of the conference, for example, its multicast
- group, list of media encodings, title and organizer.
-
-
- The first two methods are examples of in-band signaling, the others of
- out-of-band signaling.
-
- Options can be encoded in a number of ways, resulting in different tradeoffs
- between flexibility, processing overhead and space requirements. In
-
- H. Schulzrinne Expires 03/01/94 [Page 10]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- general, options consists of a type field, possibly a length field, and
- the actual option value. The length field can be omitted if the length
- is implied by the option type. Implied-length options save space, but
- require special treatment while processing. While options with explicit
- length that are added in later protocol versions are backwards-compatible
- (the receiver can just skip them), implied-length options cannot be added
- without modifying all receivers, unless they are marked as such and all have
- a known length. As an example, IP defines two implied-length options, no-op
- and end-of-option, both with a length of one octet. Both CLNP and IP follow
- the type-length-data model, with different substructure of the type field.
-
- For indicating the extent of options, a number of alternatives have been
- suggested.
-
-
- option length: The fixed header contains a field containing the length of
- the options, as used for IP. This makes skipping over options easy, but
- consumes precious header space.
-
- end-of-options bit: Each option contains a special bit that is set only for
- the last option in the list. In addition, the fixed header contains
- a flag indicating that options are present. This conserves space
- in the fixed header, at the expense of reducing usable space within
- options, e.g., reducing the number of possible option types or the
- maximum option length. It also makes skipping options somewhat more
- processing-intensive, particulary if some options have implied lengths
- and others have explicit lengths. Skipping through the options list
- can be accelerated slightly by starting options with a length field.
-
- end-of-options option: A special option type indicates the end of the
- option list, with a bit in the fixed header indicating the presence of
- options. The properties of this approach are similar to the previous
- one, except that it can be expected to take up more header space.
-
- options directory: An options-present bit in the fixed header indicates
- the presence of an options directory. The options directory in
- turn contains a length field for the options list and possibly bits
- indicating the presence of certain options or option classes. The
- option length makes skipping options fast, while the presence bits
- allow a quick decision whether the options list should be scanned for
- relevant options. If all options have a known, fixed length, the bit
- mask can be used to directly access certain options, without having
- to traverse parts of the options list. The drawback is increased
- header space and the necessity to create the directory. If options are
- explicitly coded in the bit mask, the type, number and numbering of
- options is restricted. This approach is used by PIP [9].
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 11]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.1 Duplex or Simplex?
-
-
- In terms of information flow, protocols can be roughly divided into three
- categories:
-
-
- 1. For one instance of a protocol, packets travel only in one direction;
- i.e., the receiver has no way to directly influence the sender. UDP is
- an example of such a protocol.
-
- 2. While data only travels in one direction, the receiver can send back
- control packets, for example, to accept or reject a connection, or
- request retransmission. ST-II in its standard simplex mode is an
- example; TCP is symmetric (see next item), but during a file transfer,
- it typically operates in this mode, where one side sends data and the
- receiver of the data returns acknowledgements.
-
- 3. The protocol is fully symmetric during the data transfer phase, with
- user data and control information travelling in both directions. TCP
- is a symmetric protocol.
-
-
- Note that bidirectional data flow can usually be simulated by two or more
- one-directional data flows in opposite directions, however, if the data
- sinks need to transmit control information to the source, a decoupled stream
- in the reverse direction will not do without additional machinery to bridge
- the gap between the two protocol state machines.
-
- For most of the anticipated applications for a real-time transport
- protocol, one-directional data flow appears sufficient. Also, in general,
- bidirectional flows may be difficult to maintain in one-to-many settings
- commonly found in conferences. Real-time requirements combined with
- network latency make achieving reliability through retransmission difficult,
- eliminating another reason for a bidirectional communication channel. Thus,
- we will focus only on control flow from the receiver of a data flow to its
- sender. For brevity, we will refer to packets of this control flow as
- reverse control packets.
-
- There are at least two areas within multimedia conferences where a receiver
- needs to communicate control information back to the source. First, the
- sender may want or need to know how well the transmission is proceding,
- as traditional feedback through acknowledgements is missing (and usually
- infeasible due to acknowledgment implosion). Secondly, the receiver should
- be able to request a selective update of its state, for example, to obtain
- missing image blocks after joining an on-going conference. Note that for
- both uses, unicast rather than multicast is appropriate.
-
- Three approaches allowing the sender to distinguish reverse control packets
- from data packets are compared here:
-
-
- H. Schulzrinne Expires 03/01/94 [Page 12]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- sender port equals reverse port, marked packet: The same port number is
- used both for data and return control messages. Packets then have to
- be marked to allow distinguishing the two. Either the presence of
- certain options would indicate a reverse control packet, or the options
- themselves would be interpreted as reverse control information, with
- the rest of the packet treated as regular data. The latter approach
- appears to be the most flexible and symmetric, and is similar in
- spirit to transport protocols with piggy-backed acknowledgements as in
- TCP. Also, since several conferences with different multicast addresses
- may be using the same port number, the receiver has to include the
- multicast address in its reverse control messages. As a final
- identification, the control packets have to bear the flow identifier
- they belong to. The scheme has the grave disadvantage that every
- application on a host has to receive the reverse control messages and
- decide whether it involves a flow it is responsible for.
-
- single reverse port: Reverse control packets for all flows use a single
- port that differs from the data port. Since the type of the packet
- (control vs. data) is identified by the port number, only the
- multicast address and flow number still needs to be included, without a
- need for a distinguishing packet format. Adding a port means that port
- negotiation is somewhat more complicated; also, as in the first scheme,
- the application still has to demultiplex incoming control messages.
-
- different reverse port for each flow: This method requires that each source
- makes it known to all receivers on which port it wishes to receive
- reverse control messages. Demultiplexing based on flow and multicast
- address is no longer necessary. However, each participant sending
- data and expecting return control messages has to communicate the port
- number to all other participants. Since the reverse control port
- number should remain constant throughout the conference (except after
- application restarts), a periodic dissemination of that information is
- sufficient. Distributing the port information has the advantage that
- it gives applications the flexibility to designate only certain flows
- as potential recipients of reverse control information.
-
- Unfortunately, the delay in acquiring the reverse control port number
- when joining an on-going conference may make one of the more
- interesting uses of a reverse control channel difficult to implement,
- namely the request by a new arrival to the sender to transmit the
- complete current state (e.g., image) rather than changes only.
-
-
- 3.2 Framing
-
-
- To satisfy the goal of transport independence, we cannot assume that the
- lower layer provides framing. (Consider TCP as an example; it would
- probably not be used for real-time applications except possibly on a local
- network, but it may be useful in distributing recorded audio or video
- segments.) It may also be desirable to pack several RTPDUs into a single
-
- H. Schulzrinne Expires 03/01/94 [Page 13]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- TPDU.
-
- The obvious solution is to provide for an optional message length prefixed
- to the actual packet. If the underlying protocol does not message
- delineation, both sender and receiver would know to use the message length.
- If used to carry multiple RTPDUs, all participants would have to arrive
- at a mutual agreement as to its use. A 16-bit field should cover most
- needs, but appears to break the 4-byte alignment for the rest of the header.
- However, an application would read the message length first and then copy
- the appropriate number of bytes into a buffer, suitably aligned.
-
-
- 3.3 Version Identification
-
-
- Humility suggests that we anticipate that we may not get the first iteration
- of the protocol right. In order to avoid ``flag days'' where everybody
- shifts to a new protocol, a version identifier could ensure continued
- interoperability. Alternatively, a new port could be used, as long as only
- one port (or at most a few ports) is used for all media. The difficulty in
- interworking between the current vat and NVP protocols further affirms the
- desirability of a version identifier. However, the version identifier can
- be anticipated to be the most static of all proposed header fields. Since
- the length of the header and the location and meaning of the option length
- field may be affected by a version change, encoding the version within an
- optional field is not feasible.
-
- Putting the version number into the control protocol packets would make RTCP
- mandatory and would make rapid scanning of conferences significantly more
- difficult.
-
- vat currently offers a 2-bit version field, while this capability is missing
- from NVP. Given the low bit usage and their utility in other contexts (IP,
- ST-II), it may be prudent to include a version identifier. To be useful,
- any version field must be placed at the very beginning of the header.
- Assigning an initial version value of one to RTP allows interoperability
- with the current vat protocol.
-
-
- 3.4 Conference Identification
-
-
- A conference identifier (conference ID) could serve two mutually exclusive
- functions: providing another level of demultiplexing or a means of
- logically aggregating flows with different network addresses and port
- numbers. vat specifies a 16-bit conference identifier.
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 14]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.4.1 Demultiplexing
-
-
- Demultiplexing by RTP allows one association characterized by destination
- address and port number to carry several distinct conferences. However,
- this appears to be necessary only if the number of conferences exceeds the
- demultiplexing capability available through (multicast) addresses and port
- numbers.
-
- Efficiency arguments suggest that combining several conferences or media
- within a single multicast group is not desirable. Combining several
- conferences or media within a single multicast address reduces the bandwidth
- efficiency afforded by multicasting if the sets of destinations are
- different. Also, applications that are not interested in a particular
- conference or capable of dealing with particular medium are still forced to
- handle the packets delivered for that conference or medium. Consider as an
- example two separate applications, one for audio, one for video. If both
- share the same multicast address and port, being differentiated only by the
- conference identifier, the operating system has to copy each incoming audio
- and video packet into two application buffers and perform a context switch
- to both applications, only to have one immediately discard the incoming
- packet.
-
- Given that application-layer demultiplexing has strong negative efficiency
- implications and given that multicast addresses are not an extremely
- scarce commodity, there seems to be no reason to burden every application
- with maintaining and checking conference identifiers for the purpose of
- demultiplexing. However, if this protocol is to be used as a transport
- protocol, demultiplexing capability is required.
-
- It is also not recommended to use a conference identifier to distinguish
- between different encodings, as it would be difficult for the application
- to decide whether a new conference identifier means that a new conference
- has arrived or simply all participants should be moved to the new conference
- with a different encoding. Since the encoding may change for some but
- not all participants, we could find ourselves breaking a single logical
- conference into several pieces, with a fairly elaborate control mechanism to
- decide which conferences logically belong together.
-
-
- 3.4.2 Aggregation
-
-
- Particularly within a network with a wide range of capacities, differing
- multicast groups for each media component of a conference allows to
- tailor the media distribution to the network bandwidths and end-system
- capabilities. It appears useful, however, to have a means of identifying
- groups that logically belong together, for example for purposes of time
- synchronization.
-
- A conference identifier used in this manner would have to be globally
-
- H. Schulzrinne Expires 03/01/94 [Page 15]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- unique. It appears that such logical connections would better be identified
- as part of the higher-layer control protocol by identifying all multicast
- addresses belonging to the same logical conference, thereby avoiding the
- assignment of globally unique identifiers.
-
-
- 3.5 Media Encoding Identification
-
-
- This field plays a similar role to the protocol field in data link
- or network protocols, indicating the next higher layer (here, the media
- decoder) that the data is meant for. For RTP, this field would indicate the
- audio or video or other media encoding. In general, the number of distinct
- encodings should be kept as small as possible to increase the chance that
- applications can interoperate. A new encoding should only be recognized
- if it significantly enhances the range of media quality or the types of
- networks conferences can be conducted over. The unnecessary proliferation
- of encodings can be reduced by making reference implementations of standard
- encoders and decoders widely available.
-
- It should be noted that encodings may not be enumerable as easily as, say,
- transport protocols. A particular family of related encoding methods may
- be described by a set of parameters, as discussed below in the sections on
- audio and video encoding.
-
- Encodings may change during the duration of a conference. This may be
- due to changed network conditions, changed user preference or because the
- conference is joined by a new participant that cannot decode the current
- encoding. If the information necessary for the decoder is conveyed
- out-of-band, some means of indicating when the change is effective needs to
- be incorporated. Also, the indication that the encoding is about to change
- must reach all receivers reliably before the first packet employing the new
- encoding. Each receiver needs to track pending changes of encodings and
- check for every incoming packet whether an encoding change is to take effect
- with this packet.
-
- Conveying media encodings rapidly is also important to allow scanning of
- conferences or broadcast media. Note that it is not necessary to convey
- the whole encoder description, with all parameters; an index into a table of
- well-known encodings is probably preferable. An index would also make it
- easier to detect whether the encoding has changed.
-
- Alternatively, a directory or announcement service could provide encoding
- information for on-going conferences, without carrying the information in
- every packet. This may not be sufficient, however, unless all participants
- within a conference use the same encoding. As soon as the encoding
- information is separated from the media data, a synchronization mechanism
- has to be devised that ensures that sender and receiver interpret the data
- in the same manner after the out-of-band information has been updated.
-
- There are at least two approaches to indicating media encoding, either
-
- H. Schulzrinne Expires 03/01/94 [Page 16]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- in-band or out-of-band:
-
-
- conference-specific: Here, the media identifier is an index into a table
- designating the approved or anticipated encodings (together with any
- particular version numbers or other parameters) for a particular
- conference or user community. The table can be distributed
- through RTCP, a higher-layer conference control protocol, a conference
- announcement service or some other out-of-band means. Since the number
- of encodings used during a single conference is likely to be small, the
- field width in the header can likewise be small. Also, there is no
- need to agree on an Internet-wide list of encodings. It should be
- noted that conveying the table of encodings through RTCP forces the
- application to maintain a separate mapping table for each sender as
- there can be no guarantee that all senders will use the same table.
- Since the control protocol proposed here is unreliable, changing the
- meaning of encoding indices dynamically is fraught with possibilities
- for misinterpretation and lost data unless this mapping is carried in
- every packet.
-
- global: Here, the media identifier is an index into a global table
- of encodings. A global list reduces the need for out-of-band
- information. Transmitting the parameters associated with an encoding
- may be difficult, however, if it has to be done within the header space
- constraints of per-packet signaling.
-
-
- To make detecting coder mismatches easier, encodings for all media should
- be drawn from the same numbering space. To facilitate experimentation with
- new encodings, a part of any global encoding numbering space should be
- set aside for experimental encodings, with numbers agreed upon within the
- community experimenting with the encoding, with no Internet-wide guarantee
- of uniqueness.
-
-
- 3.5.1 Audio Encodings
-
-
- Audio data is commonly characterized by three independent descriptors:
- encoding (the translation of one or more audio samples into a channel
- symbol), the number of channels (mono, stereo, :::) and the sampling rate.
-
- Theoretically, sampling rate and encoding are (largely) independent. We
- could, for example, apply mu-law encoding to any sampling rate even though
- it is traditionally used with a rate of 8,000 Hz. In practical terms, it
- may be desirable to limit the combinations of encoding and sampling rate to
- the values the encoding was designed for.(2) Channel counts between 1 and
- ------------------------------
- 2. Given the wide availability of mu-law encoding and its low overhead,
- using it with a sampling rate of 16,000 or 32,000 Hz might be quite
-
-
- H. Schulzrinne Expires 03/01/94 [Page 17]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 6 should be sufficient even for surround sound.
-
- The audio encodings listed in Table 1 appear particularly interesting,
- even though the list is by no means exhaustive and does not include some
- experimental encodings currently in use, for example a non-standard form of
- LPC. The bit rate is shown per channel. k samples/s, b/sample and kb/s
- denote kilosamples per second, bits per sample and kilobits per second,
- respectively. If sampling rates are to be specified separately, the values
- of 8, 16, 32, 44.1, and 48 kHz suggest themselves, even though other
- values (11.025 and 22.05 kHz) are supported on some workstations (the
- Silicon Graphics audio hardware and the Apple Macintosh, for example).
- Clearly, little is to be gained by allowing arbitrary sampling rates, as
- conversion particularly between rates not related by simple fractions is
- quite cumbersome and processing-intensive [10].
-
-
- Org. Name k samples/s b/sample kb/s description
- CCITT G.711 8.0 8 64 mu-law PCM
- CCITT G.711 8.0 8 64 A-law PCM
- CCITT G.721 8.0 4 32 ADPCM
- Intel DVI 8.0 4 32 APDCM
- CCITT G.723 8.0 3 24 ADPCM
- CCITT G.726 ADPCM
- CCITT G.727 ADPCM
- NIST/GSA FS 1015 8.0 2.4 LPC-10E
- NIST/GSA FS 1016 8.0 4.8 CELP
- NADC IS-54 8.0 7.95 N. American Digital Cellular, VSELP
- CCITT G.728 8.0 16 LD-CELP
- GSM 8.0 13 RPE-LTP
- CCITT G.722 8.0 64 7 kHz, SB-ADPCM
- ISO 3-11172 256 MPEG audio
- 32.0 16 512 DAT
- 44.1 16 705.6 CD, DAT playback
- 48.0 16 786 DAT record
-
-
- Table 1: Standardized and common audio encodings
- ------------------------------
- appropriate for high-quality audio conferences, even though there are other
- encodings, such as G.722, specifically designed for such applications. Note
- that the signal-to-noise ratio of mu-law encoding is about 38 dB, equivalent
- to an AM receiver. The ``telephone quality'' associated with G.711 is due
- primarily to the limitation in frequency response to the 200 to 3500 Hz
- range.
-
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 18]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.5.2 Video Encodings
-
-
- Common video encodings are listed in Table 2. Encodings with tunable rate
- can be configured for different rates, but produce a fixed-rate stream.
- The average bit rate produced by variable-rate codecs depends on the source
- material.
-
- Org. name rate remarks
- CCITT JPEG tunable
- CCITT MPEG variable, tunable
- CCITT H.261 tunable, px64 kb/s
- Bolter variable, tunable
- PictureTel ??
- Cornell U. CU-SeeMe variable
- Xerox Parc nv variable, tunable
- BBN DVC variable, tunable block differences
-
-
- Table 2: Common video encodings
-
-
- 3.6 Playout Synchronization
-
-
- A major purpose of RTP is to provide the support for various forms of
- synchronization, without necessarily performing the synchronization itself.
- We can distinguish three kinds of synchronization:
-
-
- playout synchronization: The receiver plays out the medium a fixed time
- after it was generated at the source (end-to-end delay). This
- end-to-end delay may vary from synchronization unit to synchronization
- unit. In other words, playout synchronization assures that a constant
- rate source at the sender again becomes a constant rate source at the
- receiver, despite delay jitter in the network.
-
- intra-media synchronization: All receivers play the same segment of a
- medium at the same time. Intra-media synchronization may be needed
- during simulations and wargaming.
-
- inter-media synchronization: The timing relationship between several media
- sources is reconstructed at the receiver. The primary example is
- the synchronization between audio and video (lip-sync). Note that
- different receivers may experience different delays between the media
- generation time and their playout time.
-
-
- Playout synchronization is required for most media, while intra-media and
- inter-media synchronization may or may not be implemented. In connection
-
- H. Schulzrinne Expires 03/01/94 [Page 19]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- with playout synchronization, we can group packets into playout units, a
- number of which in turn form a synchronization unit. More specifically, we
- define:
-
-
- synchronization unit: A synchronization unit consists of one or more
- playout units (see below) that, as a group, share a common fixed delay
- between generation and playout of each part of the group. The delay
- may change at the beginning of such a synchronization unit. The most
- common synchronization units are talkspurts for voice and frames for
- video transmission.
-
- playout unit: A playout unit is a group of packets sharing a common
- timestamp. (Naturally, packets whose timestamps are identical due
- to timestamp wrap-around are not considered part of the same playout
- unit.) For voice, the playout unit would typically be a single voice
- segment, while for video a video frame could be broken down into
- subframes, each consisting of packets sharing the same timestamp and
- ordered by some form of sequence number.
-
-
- Two concepts related to synchronization and playout units are absolute and
- relative timing. Absolute timing maintains a fixed timing relationship
- between sender and receiver, while relative timing ensures that the spacing
- between packets at the sender is the same as that at the receiver, measured
- in terms of the sampling clock. Playout units within the synchronization
- unit maintain relative timing with respect to each other; absolute timing is
- undesirable if the receiver clock runs at a (slightly) different rate than
- the sender clock.
-
- Most proposed synchronization methods require a timestamp. The timestamp
- has to have a sufficient range that wrap-arounds are infrequent. It
- is desirable that the range exceeds the maximum expected inactive (e.g.,
- silence) period. Otherwise, if the silence period lasts a full timestamp
- range, the first packet of the next talkspurt would have a timestamp one
- larger than the last packet of the current talkspurt. In that case, the
- new talkspurt could not be readily discerned if the difference in increment
- between timestamps and sequence numbers is used to detect a new talkspurt.
-
- The 10-bit timestamp used by NVP is generally agreed to be too small as it
- wraps around after only 20.5 s (for 20 ms audio packets), while a 32-bit
- timestamp should serve all anticipated needs, even if the timestamp is
- expressed in units of samples or other sub-packet entities.
-
- A timestamp may be useful not only at the transport, but also at the network
- layer, for example, for scheduling packets based on urgency. The playout
- timestamp would be appropriate for such a scheduling timestamp, as it would
- better reflect urgency than a network-level departure timestamp. Thus, it
- may make sense to use a network-level timestamp such as the one provided by
- ST-II at the transport layer.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 20]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.6.1 Synchronization Methods
-
-
- The necessary header components are determined to some extent by the method
- of synchronizing sender and receivers. In this section, we formally
- describe some of the popular approaches, building on the exposition and
- terminology of Montgomery [11].
-
- We define a number of variables describing the synchronization process. In
- general, the subscript n represents the nth packet in a synchronization
- unit, n=1;2;:::. Let a , d , p and t be the arrival time, variable
- n n n n
- delay, playout time and generation time of the nth packet, respectively.
- Let o denote the fixed delay from sender to receiver. Finally, d
- max
- describes the estimated maximum variable delay within the network. The
- estimate is typically chosen in such a way that only a very small fraction
- (on the order of 1%) of packets take more than o+d time units. For best
- max
- performance under changing network load conditions, the estimate should be
- refined based on the actual delays experienced. The variable delay in a
- network consists of queueing and media access delays, while propagation and
- processing delays make up the fixed delay. Additional end-to-end fixed
- delay is unavoidably introduced by packetization; the non-real-time nature
- of most operating systems adds a variable delay both at the transmitting and
- receiving end. All variables are expressed in sample unit of time, be
- that seconds or samples, for example. For simplicity, we ignore that the
- sender and receiver clocks may not run at exactly the same speed. The
- relationship between the variables is depicted in Fig. 2. The arrows in the
- figure indicate the transmission of the packet across the network, occurring
- after the packetization delay. The packet with sequence number 5 misses the
- playout deadline and, depending on the algorithm used by the receiver, is
- either dropped or treated as the beginning of a new talkspurt.
-
-
- Figure only available in PostScript version of document.
- Figure 2: Playout Synchronization Variables
-
- Given the above definitions, the relationship
-
- a =t +d +o (1)
- n n n
- holds for every packet. For brevity, we also define l as the ``laxity''
- n
- of packet n, i.e., the time p -a between arrival and playout. Note that
- n n
- it may be difficult to measure a with resolution below a packetization
- n
- interval, particularly if the measurement is to be in units related to the
- playback process (e.g., samples). All synchronization methods differ only
- in how much they delay the first packet of a synchronization unit. All
- packets within a synchronization unit are played out based on the position
- of the first packet:
- p =p +(t -t ) for n>1
- n n
- n-1 n-1
- Three synchronization methods are of interest. We describe below how they
- compute the playout time for the first packet in a synchronization unit and
-
- H. Schulzrinne Expires 03/01/94 [Page 21]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- what measurement is used to update the delay estimate d .
- max
-
- blind delay: This method assumes that the first packet in a talkspurt
- experiences only the fixed delay, so that the full d has to be
- max
- added to allow for other packets within the talkspurt experiencing more
- delay.
- p =a +d : (2)
- max
- 1 1
- The estimate for the variable delay is derived from measurements
- of the laxity l , so that the new estimate after n packets is
- n
- computed d =f(l ;:::;l ), where the function f(.) is a suitably
- max;n n
- 1
- chosen smoothing function. Note that blind delay does not require
- timestamps to determine p , only an indication of the beginning of
- 1
- a synchronization unit. Timestamps may be required to compute p ,
- n
- however, unless t -t is a known constant.
- n
- n-1
- absolute timing: If the packet carries a timestamp measured in time units
- known to the receiver, we can improve our determination of the playout
- point:
- p =t +o+d :
- max
- 1 1
- This is, clearly, the best that can be accomplished. Here, instead of
- estimating d , we estimate o+d as some function of p -t . For
- max max n n
- this computation, it does not matter whether p and t are measured with
- clocks sharing a common starting point.
-
- added variable delay: Each node adds the variable delay experienced within
- it to a delay accumulator within the packet, yielding d .
- n
- p =a -d +d
- max
- 1 1 1
- From Eq. 1, it is readily apparent that absolute delay and added
- variable delay yield the same playout time. The estimate for d is
- max
- based on the measurements for d. Given a clock with suitably high
- resolution, these estimates can be better than those based on the
- difference between a and p; however, it requires that all routers can
- recognize RTP packets. Also, determining the residence time within a
- router may not be feasible.
-
-
- In summary, absolute timing is to be preferred due to its lower delays
- compared to blind delay, while synchronization using added variable delays
- is currently not feasible within the Internet (it is, however, used for
- G.764).
-
-
- 3.6.2 Detection of Synchronization Units
-
-
- The receiver must have a way of readily detecting the beginning of a
- synchronization unit, as the playout scheduling of the first packet in a
- synchronization unit differs from that in the remainder of the unit. This
-
- H. Schulzrinne Expires 03/01/94 [Page 22]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- detection has to work reliably even with packet reordering; for example,
- reordering at the beginning of a talkspurt is particularly likely since
- common silence detection algorithms send a group of stored packets at the
- beginning of the talkspurt to prevent front clipping.
-
- Two basic methods have been proposed:
-
-
- timestamp and sequence number: The sequence number increases by one with
- each packet transmitted, while the timestamp reflects the total time
- covered, measured in some appropriate unit. A packet is declared to
- start a new synchronization unit if (a) it has the highest timestamp
- and sequence number seen so far (within this wraparound cycle) and
- (b) the difference in timestamp values (converted into a packet count)
- between this and the previous packet is greater than the difference in
- sequence number between those two packets.
-
- This approach has the disadvantage that it may lead to erroneous packet
- scheduling with blind delay if packets are reordered. An example is
- shown in Table 3. In the example, the playout delay is set at 50 time
- units for blind timing and 550 time units for absolute timing. The
- packet intergeneration time is 20 time units.
-
-
- blind timing absolute timing
- no reordering with reordering
- seq. timestamp arrival playout arrival playout arrival playout
- 200 1020 1520 1570 1520 1570 1520 1570
- 201 1040 1530 1590 1530 1590 1530 1590
- 202 1220 1720 1770 1725 1750 1725 1770
- 203 1240 1725 1790 1720 1770 1720 1790
- 204 1260 1792 1810 1791 1790 1791 1810
-
-
- Table 3: Example where out-of-order arrival leads to packet loss for blind
- timing
-
- More significantly, detecting synchronization units requires that the
- playout mechanism can translate timestamp differences into packet
- counts, so that it can compare timestamp and sequence number
- differences. If the timespan ``covered'' by a packet changes with
- the encoding or even varies for each packet, this may be cumbersome.
- NVP provides the timestamp/sequence number combination for detecting
- talkspurts. The following method avoids these drawbacks, at the cost
- of one additional header bit.
-
- synchronization bit: The beginning of a synchronization unit is indicated
- by setting a synchronization bit within the header. The receiver,
- however, can only use this information if no later packet has already
- been processed. Thus, packet reordering at the beginning of a
- talkspurt leads to missing opportunities for delay adjustment. With
-
- H. Schulzrinne Expires 03/01/94 [Page 23]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- the synchronization bit, a sequence number is not necessary to detect
- the beginning of a synchronization unit, but a sequence number remains
- useful for detecting packet loss and ordering packets bearing the same
- timestamp. With just a timestamp, it is impossible for the receiver
- to get an accurate count of the number of packets that it should have
- received. While gaps within a talkspurt give some indication of packet
- loss, the receiver cannot tell what part of the tail of a talkspurt
- has been transmitted. (Example: consider the talkspurts with time
- stamps 100, 101, 102, 110, 111. Packets with timestamp 100 and 110
- have the synchronization bit set. The receiver has no way of knowing
- whether it was supposed to have received two talkspurts with a total of
- five packets, or two or more talkspurts with up to 12 packets.) The
- synchronization bit is used by vat, without a sequence number. It is
- also contained in the original version of NVP [12]. A special sequence
- number, as used by G.764, is equivalent.
-
-
- 3.6.3 Interpretation of Synchronization Bit
-
-
- Two possibilities for implementing a synchronization bit are discussed here.
-
-
- start of synchronization unit: The first packet in a synchronization unit
- is marked with a set synchronization bit. With this use of
- the synchronization bit, the receiver detects the beginning of a
- synchronization unit with the following simple algorithm:
-
-
- if synchronization bit = 1
- and current sequence number > maximum sequence number seen so far
- then
- this packet starts a new synchronization unit
-
- if current sequence number > maximum sequence number
- then
- maximum sequence number := current sequence number
- endif
-
-
- Comparisons and arithmetic operations are modulo the sequence number
- range.
-
- end of synchronization unit: The last packet in a synchronization unit is
- marked. As pointed out elsewhere, this information may be useful
- for initiating appropriate fill-in during silence periods and to start
- processing a completed video frame. If a voice silence detector uses
- no hangover, it may have difficulty deciding which is the last packet
- in a talkspurt until it judges the first packet to contain no speech.
- The detection of a new synchronization unit by the receiver is only
- slightly more complicated than with the previous method:
-
- H. Schulzrinne Expires 03/01/94 [Page 24]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- if sync_flag then
- if sequence number >= sync_seq then
- sync_flag := FALSE
- endif
- if sequence number = sync_seq then
- signal beginning of synchronization unit
- endif
- endif
-
- if synchronization bit = 1 then
- sync_seq := sequence number + 1
- sync_flag := TRUE
- endif
-
-
- By changing the equal sign in the second comparison to 'if sequence
- number > syncseq', a new synchronization unit is detected even if
- packets at the beginning of the synchronization unit are reordered. As
- reordering at the beginning of a synchronization unit is particularly
- likely, for example when transmitting the packets preceding the
- beginning of a talkspurt, this should significantly reduce the number
- of missed talkspurt beginnings.
-
-
- 3.6.4 Interpretation of Timestamp
-
-
- Several proposals as to the interpretation of the timestamp have been
- advanced:
-
-
- packet or frame interval: Each packetization or (video/audio) frame inter-
- val increments the timestamp. This approach very efficient in terms
- of processing and bit-use, but cannot be used without out-of-band
- information if the time interval of media ``covered'' by a packet
- varies from packet to packet. This occurs for example with
- variable-rate encoders or if the packetization interval is changed
- during a conference. This interpretation of a timestamp is assumed by
- NVP, which defines a frame as a block of PCM samples or a single LPC
- frame. Note that there is no inherent necessity that all participants
- within a conference use the same packetization interval. Local
- implementation considerations such as available clocks may suggest
- different intervals. As another example, consider a conference with
- feedback. For the lecture audio, a long packetization interval may
- be desirable to better amortize packet headers. For side chats,
- delays are more important, thus suggesting a shorter packetization
- interval.(3)
- ------------------------------
- 3. Nevot for example, allows each participant to have a different
- packetization interval, independent of the packetization interval used by
-
-
- H. Schulzrinne Expires 03/01/94 [Page 25]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- sample: This method simply counts samples, allowing a direct translation
- between time stamp and playout buffer insertion point. It is just
- as easily computable as the per-packet timestamp. However, for some
- media and encodings(4) , it may not be quite clear what a sample is.
- Also, some care must be taken at the receiver and sender if streams use
- different sampling rates. This method is currently used by vat.
-
- Milliseconds: A timestamp incremented every millisecond would wrap around
- once every 49 days. The resolution is sufficient for most
- applications, except that the natural packetization interval for
- LPC-coded speech is 22.5 ms. Also, with a video frame rate of 30 Hz,
- an internal timestamp of higher resolution would need to be truncated
- to millisecond resolution to approximate 33.3 ms intervals. This time
- increment has the advantage of being used by some Unix delay functions,
- which might be useful for playing back video frames with proper timing.
- It might be useful to take the second value from the current system
- clock to allow delay estimates for synchronized clocks.
-
- subset of NTP timestamp: 16 bits encode seconds relative to midnight (0
- hours), January 1, 1900 (modulo 65536) and 16 bits encode fractions of
- a second, with a resolution of approximately 15.2 microseconds, which
- is smaller than any anticipated audio sampling or video frame interval.
- This timestamp is the same as the middle 32 bits of the 64-bit NTP
- timestamp [13]. It wraps around every 18.2 hours. If it should be
- desirable to reconstruct absolute transmission time at the receiver for
- logging or recording purposes, it should be easy to determine the most
- significant 16 bits of the timestamp. Otherwise, wrap-arounds are not
- a significant problem as long as they occur 'naturally', i.e., at a 16
- or 32 bit boundary, so that explicit checking on arithmetic operations
- is not required. Also, since the translation mechanism would probably
- treat the timestamp as a single integer without accounting for its
- division into whole and fractional part, the exact bit allocation
- between seconds and fractions thereof is less important. However,
- the 16/16 approach simplifies extraction from a full NTP timestamp.
- Sixteen bits of fractional seconds also allows a timestamp without
- wrap-around, i.e, with 32 bits of full seconds encoding time since
- January 1, 1990, to fit into the 52 bits of a IEEE floating point
- number.
-
- The NTP-like timestamp has the disadvantage that its resolution does
- not map into any of the common sample or packetization intervals.
- Thus, there is a potential uncertainty of one sample at the receiver
- ------------------------------
- Nevot for its outgoing audio. Only the packetization interval for outgoing
- audio for all conferences this Nevot participates in must be the same.
- 4. Examples include frame-based encodings such as LPC and CELP. Here, given
- that these encodings are based on 8,000 Hz input samples, the preferred
- interpretation would probably be in terms of audio samples, not frames, as
- samples would be used for reconstruction and mixing.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 26]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- as to where to place the beginning of the received packet, resulting
- in the equivalent of a one-sample slip. CCITT recommendation G.821
- postulates a mean slip rate of less than 1 slip in 5 hours, with
- degraded but acceptable service for less than 1 slip in 2 minutes.
- Tests with appropriate rounding conducted by the author showed that
- this uncertainty is not likely to cause problems. In any event, a
- double-precision floating point multiplication is needed to translate
- between this timestamp and the integer sample count available on
- transmission and required for playout.(5)
-
- MPEG timestamps: MPEG uses a 33 bit clock with a resolution of 90 kHz [14]
- as the system clock reference and for presentation time stamps. The
- frequency was chosen based on the divisibility by the nominal video
- picture rates of 24 Hz, 25 Hz, 29.97 Hz and 30 Hz [14, p.42]. The
- frequency would also fit nicely with the 20 ms audio packetization
- interval. The length of 33 bit is clearly inappropriate, however, for
- software implementations. 32 bit timestamps still cover more than half
- a day and thus can be readily extended to full unique timestamps or 33
- bits if needed.
-
- Microseconds: A 32-bit timestamp incremented every microsecond wraps around
- once every 71.5 minutes. The resolution is high enough that round-off
- errors for video frame intervals and such should be tolerable without
- maintaining a higher-precision internal counter. This resolution is
- also provided, at least nominally, by the Unix gettimeofday() system
- call.
-
- QuickTime: The Apple QuickTime file format is a generalization of the
- previous formats as it combines a 32-bit counter with a 32-bit media
- time scale expressed in time units per second. The four previously
- mentioned timestamps can be represented by time scales of 1000, 65536,
- 90,000 and 1,000,000. For the sample and packet-based case, the value
- would depend on the media content, e.g., 8,000 for standard PCM-coded
- audio.
-
-
- Timestamps based on wallclock time rather than samples or frames have the
- advantage that a receiver does not necessarily need to know about the
- meaning of the encoding contained in the packet in order to process the
- timestamp. For example, a quality-of-service monitor within the network
- could measure delay variance easily, without caring what kind of audio
- information, say, is contained in the packet. Other tools, such as a
- recording and playback tool, can also be written without concern about the
- mapping between timestamp and wallclock units.
- ------------------------------
- 5. The multiplication with an appropriate factor can be approximated
- to the desired precision by an integer multiplication and division, but
- multiplication by a floating point value is actually much faster on some
- modern processors.
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 27]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- A time stamp could reflect either real time or sample time. A real time
- timestamp is defined to track wallclock time plus or minus a constant
- offset. Sample time increases by the nominal sampling interval for each
- sample. The two clocks in general do not agree since the clock source used
- for sampling will in all likelihood be slightly off the nominal rate. For
- example, typical crystals without temperature control are only accurate to
- 50 -- 100 ppm (parts per million), yielding a potential drift of 0.36
- seconds per hour between the sampling clock and wallclock time.
-
- It has been suggested to use timestamps relative to the beginning of
- first transmission from a source. This makes correlation between media
- from different participants difficult and seems to have no technical or
- implementation advantages, except for avoiding wrap-around during most
- conferences. As pointed out above, that seems to be of little benefit.
- Clearly, the reliability of a wallclock-synchronized timestamps depends on
- how closely the system clocks are synchronized, but that does not argue for
- giving up potential real-time synchronization in all cases.
-
- Using real time rather than sample time allows for easier synchronization
- between different media and users (e.g., during playback of a recorded
- conference) and to compensate for slow or fast sample clocks. Note that it
- is neither desirable nor necessary to obtain the wall clock time when each
- packet was sampled. Rather, the sender determines the wallclock time at the
- beginning of each synchronization unit (e.g., a talkspurt for voice and a
- frame for video) and adds the nominal sample clock duration for all packets
- within the talkspurt to arrive at the timestamp value carried in packets.
- The real time at the beginning of a talkspurt is determined by estimating
- the true sample rate for the duration of the conference.
-
- The sample rate estimate has to be accurate enough to allow placing the
- beginning of a talkspurt, say, to within at most 50 to 100 ms, otherwise the
- lack of synchronization may be noticeable, delay computations are confused
- and successive talkspurts may be concatenated.
-
- Estimating the true sampling instant to within a few milliseconds is
- surprisingly difficult for current operating systems. The sample rate r can
- to be estimated as
- s+q
- r= :
- t-t
- 0
- Here, t is the current time, t the time elapsed since the first sample
- 0
- was acquired, s is the number of samples read, q is the number of samples
- ready to be read (queued) at time t. Let p denote the number of samples
- in a packet. The timestamp in the synchronization packet reflects the
- sampling instant of the first sample of that packet and is computed as
- t-(p+q)=r. Unfortunately, only s and p are known precisely. The accuracy
- of the estimate for t and t depend on how accurately the beginning of
- 0
- sampling and the last reading from the audio device can be measured. There
- is a non-zero probability that the process will get preempted between the
- time the audio data is read and the instant the system clock is sampled.
- It remains unclear whether indications of current buffer occupancy, if
- available, can be trusted. Even with increasing sample count, the absolute
-
-
- H. Schulzrinne Expires 03/01/94 [Page 28]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- accuracy of the timestamp is roughly the same as the measurement accuracy of
- t, as differentiating with respect to t shows. Experiments with the SunOS
- audio driver showed significant variations of the estimated sample rate,
- with discontinuities of the computed timestamps of up to 25 ms. Kernel
- support is probably required for meaningful real time measurements.
-
- Sample time increments with the sampling interval for every sample or
- (sub)frame received from the audio or video hardware. It is easy to
- determine, as long as care is taken to avoid cumulative round-off errors
- incurred by simply repeatedly adding the approximate packetization interval.
- However, synchronization between media and end-to-end delay measurements are
- then no longer feasible. (Example: Consider an audio and a video stream.
- If the audio sample clock is slightly faster than the real clock and the
- video sampling clock, a video and audio frame belonging together would be
- marked by different timestamps, thus played out at different instants.)
-
- If we choose to use sample time, the advantage of using an NTP-format
- timestamp disappears, as the receiver can easily reconstruct a NTP
- sample-based timestamp from the sample count if needed, but would not have
- to if no cross-media synchronization is required. RTCP could relate the
- time increment per sample in full precision. The definition of a ``sample''
- will depend on the particular medium, and could be a audio sample, a video
- or a voice frame (as produced by a non-waveform coder). The mapping fails
- if there is no time-invariant mapping between sample units and time.
-
- It should be noted that it may not be possible to associate an meaningful
- notion of time with every packet. For example, if a video frame is
- broken into several fragments, there is no natural timestamp associated
- with anything but the first fragment, particularly if there is not even
- a sequential mapping from screen scan location into packets. Thus, any
- timestamp used would be purely artificial. A synchronization bit could be
- used in this particular case to mark beginning of synchronization units.
- For packets within synchronization units, there are two possible approaches:
- first, we can introduce an auxiliary sequence number that is only used to
- order packets within a frame. Secondly, we could abuse the timestamp field
- by incrementing it by a single unit for each packet within the frame, thus
- allowing a variable number of frames per packet. The latter approach is
- barely workable and rather kludgy.
-
-
- 3.6.5 End-of-talkspurt indication
-
-
- An end-of-talkspurt indication is useful to distinguish silence from lost
- packets. The receiver would want to replace silence by an appropriate
- background noise level to avoid the ``noise-pumping'' associated with
- silence detection. On the other hand, missing packets should be
- reconstructed from previous packets. If the silence detector makes use
- of hangover, the transmitter can easily set the end-of-talkspurt indicator
- on the last bit of the last hangover packet. If the talkspurts follow
- end-to-end, the end-of-talkspurt indicator has no effect except in the
-
- H. Schulzrinne Expires 03/01/94 [Page 29]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- case where the first packet of a talkspurt is lost. In that case, the
- indicator would erroneously trigger noise fill instead of loss recovery.
- The end-of-talkspurt indicator is implemented in G.764 as a ``more'' bit
- which is set to one for all but the last packet within a talkspurt.
-
-
- 3.6.6 Recommendation
-
-
- Given the ease of cross-media synchronization and the media independence,
- the use of 32-bit 16/16 timestamps representing the middle part of the NTP
- timestamp is suggested. Generally, a wallclock-based timestamp appears
- to be preferable to a sample-based one, but it may only be approximately
- realizable for some current operating systems. Inter-media synchronization
- to below 10 to 20 ms has to await mechanisms that can accurately determine
- when a particular sample was actually received by the A/D converter.
- Particularly with sample- or wallclock-based timestamp, a synchronization
- bit simplifies the detection of the beginning of a synchronization unit.
- Indicating either the end or beginning of a synchronization unit is roughly
- equivalent, with tradeoffs between the two.
-
-
- 3.7 Segmentation and Reassembly
-
-
- For high-bandwidth video, a single frame may not fit into the maximum
- transport unit (MTU). Thus, some form of frame sequence number is needed.
- If possible, the same sequence number should be used for synchronization and
- fragmentation. Six possibilities suggest themselves:
-
-
- overload the timestamp: No sequence number is used. Within a frame, the
- timestamp has no meaning. Since it is used for synchronization only
- when the synchronization bit is set, the other timestamps can just
- increase by one for each packet. However, as soon as the first
- frame gets lost or reordered, determining positions and timing becomes
- difficult or impossible.
-
- packet count: The sequence number is incremented for every packet, without
- regard to frame boundaries. If a frame consists of a variable number
- of packets, it may not be clear what position the packet occupies
- within the frame if packets are lost or reordered. Continuous sequence
- numbers make it possible to determine if all packets for a particular
- frame have arrived, but only after the first packet of the next frame,
- distinguished by a new timestamp, has arrived.
-
- packet count within a frame: The sequence number is reset to zero at the
- beginning of each frame. This approach has properties complementary to
- continuous sequence numbers.
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 30]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- packet count and first-packet sequence number: Packets use a continuously
- incrementing sequence number plus an option field in every packet
- indicating the initial sequence number within the playout unit(6) .
- Carrying both a continuous and packet-within-frame count achieves the
- same effect.
-
- packet count with last-packet sequence number: Packets carry a continuous
- sequence number plus an option in every packet indicating the last
- sequence number within the playout unit. This has the advantage that
- the receiver can readily detect when the last packet for a playout unit
- has been received. The transmitter may not know, however, at the
- beginning of a playout unit how many packets it will comprise. Also,
- the position within the playout unit is more difficult to determine if
- the initial packet and the previous frame is lost.
-
- packet count and frame count: The sequence number counts packets, without
- regard to frame boundaries. A separate counter increments with each
- frame. Detecting the end of a frame is delayed until the first packet
- belonging to the next frame. Also, the frame count cannot help to
- determe the position of the packet within a frame.
-
-
- It could be argued that encoding-specific location information should be
- contained within the media part, as it will likely vary in format and use
- from one media to the next. Thus, frame count, the sequence number of the
- last or first packet in a frame etc. belong into the media-specific header.
-
- The size of the sequence number field should be large enough to allow
- unambiguous counting of expected vs. received packets. A 16-bit sequence
- number would wrap around every 20 minutes for a 20 ms packetization
- interval. Using 16 bits may also simplify modulo arithmetic.
-
-
- 3.8 Source Identification
-
-
- 3.8.1 Bridges, Translators and End Systems
-
-
- It is necessary to be able to identify the origin of the real-time data in
- terms meaningful to the application. First, this is required to demultiplex
- sites (or sources) within the same conference. Secondly, it allows an
- indication of the currently active source.
-
- Currently, NVP makes no explicit provisions for this, assuming that the
- network source address can be used. This may fail if intermediate agents
- intervene between the content source and final destination. Consider the
- example in Fig. 3. An RTP-level bridge is defined as an entity that
- ------------------------------
- 6. suggested by Steve Casner
-
- H. Schulzrinne Expires 03/01/94 [Page 31]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- transforms either the RTP header or the RTP media data or both. Such
- a bridge could for example merge two successive packets for increased
- transport efficiency or, probably the most common case, translate media
- encodings for each stream, say from PCM to LPC (called transcoding).
- A synchronizing bridge is defined here as a bridge that recreates a
- synchronous media stream, possibly after mixing several sources. An
- application that mixes all incoming streams for a particular conference,
- recreates a synchronous audio stream and then forwards it to a set of
- receivers is an example of a synchronizing bridge. A synchronizing bridge
- could be built from two end system applications, with the first application
- feeding the media output to the media input of the second application and
- vice versa.
-
- In figure 3, the bridges are used to translate audio encodings, from PCM
- and ADPCM to LPC. The bridge could be either synchronizing or not. Note
- that a resynchronizing bridge is only necessary if audio packets depend on
- their predecessors and thus cannot be transcoded independently. It may be
- advantageous if the packetization interval can be increased. Also, for low
- speed links that are barely able to handle one active source at a time,
- mixing at the bridge avoids excessive queueing delays when several sources
- are active at the same time. A synchronizing bridge has the disadvantage
- that it always increases the end-to-end delay.
-
- We define translators as transport-level entities that translate between
- transport protocols, but leave the RTP protocol unit untouched. In the
- figure, the translator connects a multicast group to a group of hosts that
- are not multicast capable by performing transport-level replication.
-
- We define an end system as an entity that receives and generates media
- content, but does not forward it.
-
- We define three types of sources: the content source is the actual origins
- of the media, e.g., the talker in an audiocast; a synchronization source
- is the combination of several content sources with its own timing; network
- source is the network-level origin as seen by the end system receiving the
- media.
-
- The end system has to synchronize its playout with the synchronization
- source, indicate the active party according to the content source and return
- media to the network source. If an end system receives media through a
- resynchronizing bridge, the end system will see the bridge as the network
- and synchronization source, but the content sources should not be affected.
- The translator does not affect the media or synchronization sources, but the
- translator becomes the network source. (Note that having the translator
- change the IP source address is not possible since the end systems need
- to be able to return their media to the translator.) In the (common)
- case where no bridge or translator intercepts packets between sender and
- receiver, content, synchronization and network source are identical. If
- there are several bridges or translators between sender and receiver, only
- the last one is visible to the receiver.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 32]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
-
-
- /-------" +------+
- | | ADPCM | |
- | group |<------>| GW |--" LPC
- | | | | " /------ end system
- "-------/ +------+ "|"/
- reflector | >------- end system
- /-------" +------+ /|/"
- | | PCM | | / "------ end system
- | group |<------>| GW |--/ LPC
- | | | |
- "-------/ +------+
-
- <---> multicast
- Figure 3: Bridge topology
-
- vat audio packets include a variable-length list of at most 64 4-byte
- identifiers containing all content sources of the packet. However, there is
- no convenient way to distinguish the synchronization source from the network
- source. The end system needs to be able to distinguish synchronization
- sources because jitter computation and playout delay differ for each
- synchronization source.
-
-
- 3.8.2 Address Format Issues
-
-
- The limitation to four bytes of addressing information may not be desirable
- for a number of reasons. Currently, it is used to hold an IP address. This
- works as long as four bytes are sufficient to hold an identifier that is
- unique throughout the conference and as long as there is only one media
- source per IP address. The latter assumption tends to be true for many
- current workstations, but it is easy to imagine scenarios where it might not
- be, e.g., a system could hold a number of audio cards, could have several
- audio channels (Silicon Graphics systems, for example) or could serve as a
- multi-line telephone interface.(7)
-
- The combination of IP address and source port can identify multiple sources
- per site if each content source uses a different source port. For a small
- number of sources, it appears feasible, if inelegant, to allocate ports just
- to distinguish sources. In the PBX example a single output port would
- appear to be the appropriate method for sending all incoming calls across
- the network. The mechanisms for allocating unique file names could also be
- used. The difficult part will be to convince all applications to draw from
- ------------------------------
- 7. If we are willing to forego the identification with a site, we could
- have a multiple-audio channel site pick unused IP addresses from the local
- network and associate it with the second and following audio ports.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 33]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- the same numbering space.
-
- For efficiency in the common case of one source per workstation, the
- convention (used in vat) of using the network source address, possibly
- combined with the user id or source port, as media and synchronization
- source should be maintained.
-
- There are several possible approaches to naming sources. We compare here
- two examples representing naming through globally unique network addresses
- and through a concatenation of locally unique identifiers.
-
- The receiver needs to be able to uniquely identify the content source so
- that speaker indication and labeling work. For playout synchronization, the
- synchronization source needs to be determined. The identification mechanism
- has to continue to work even if the path between sender and receiver
- contains multiple bridges and translators.
-
- Also, in the common case of no bridges or translators, the only information
- available at the receiver is the network address and source port. This
- can cause difficulties if there is more than one participant per host in a
- certain conference. If this can occur, it is necessary that the application
- opens two sockets, one for listening bound to the conference port number and
- one for sending, bound to some locally unique port. That randomly chosen
- port should also be used for reverse application data, i.e., requests from
- the receiver back to the content source. Only the listening socket needs
- to be a member of the IP multicast group. If an application multiplexes
- several locally generated sources, e.g., an interface to an audio bridge,
- it should follow the rules for bridges, that is, insert content source
- information.
-
-
- 3.8.3 Globally unique identifiers
-
-
- Sources are identified by their network address and the source port number.
- The source port number rather than some other integer has to be chosen for
- the common case that RTP packets contain no SSRC or CSRC options. Since
- the SDES option contains an address, it has to be the network address
- plus source port, no other information being available to the receiver
- for matching. (The SDES address is not strictly needed unless a bridge
- with mixing is involved, but carrying it keeps the receiver from having
- to distinguish those cases.) Since tying a protocol too closely to one
- particular network protocol is considered a bad idea (witness the difficulty
- of adopting parts of FTP for non-IP protocols), the address should probably
- have the form of a type-lenght-value field. To avoid having to manage yet
- another name space, it appears possible to re-use the Ethertype values, as
- all commonly used protocols with their own address space appear to have been
- assigned such a value. Other alternatives, such as using the BSD Unix
- AF constants suffer from the drawback that there does not appear to be a
- universally agreed-upon numbering. NSAPs can contain other addresses, but
- not every address format (such as IP) has an NSAP representation. The
-
- H. Schulzrinne Expires 03/01/94 [Page 34]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- receiver application does not need to interpret the addresses themselves; it
- treats address format identifier (e.g., the Ethertype field) and address as
- a globally unique byte string. We have to assure a single host does not use
- two network addresses, one for transmission and a different one in the SDES
- option.
-
- The rules for adding CSRC and SSRC options are simple:
-
-
- end system: End systems do not insert CSRC or SSRC options. The receiver
- remembers the CSRC address for each site; if none is explicitly
- specified, the SSRC address is used. If that is also missing, the
- network address is used. SDES options are matched to this content
- source address.
-
- bridge: A bridge adds the network source address of all sources
- contributing to a particular outgoing packet as CSRC options. A bridge
- that receives a packet containing CSRC options may decide to copy those
- CSRC options into an outgoing packet that contains data from that
- bridge.
-
- translator: The translator checks whether the packet already contains a
- SSRC (inserted by an earlier translator). If so, no action is
- required. Otherwise, the translator inserts an SSRC containing the
- network address of the host from which the packet was received.
-
-
- The SSRC option is set only by the translator, unless the packet already
- bears such an option.
-
- Globally unique identifiers based on network addresses have the advantage
- that they simplify debugging, for example, allowing to determine which
- bridge processed a message, even after the packet has passed through a
- translator.
-
-
- 3.8.4 Locally unique addresses
-
-
- In this scheme, the SSRC, CSRC and SDES options contain locally unique
- identifiers of some length. For lengths of at least four bytes, it
- is sufficient to have the application pick one at random, without local
- coordination, with sufficiently low probability of collision within a single
- host. The receiver creates a globally unique identifier by concatenating
- the network address and one or more random identifiers. The synchronization
- source is identified by the concatenation of the SSRC identifier and the
- network address. Only translators are allowed to set the SSRC option. If a
- translator receives an RTP packet which already contains an SSRC option, as
- can occur if a packet traverses several translators, the translator has to
- choose a new set of values, mapping packets with the same network source,
- but different incoming SSRC value into different outgoing SSRC values. Note
-
- H. Schulzrinne Expires 03/01/94 [Page 35]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- that the SSRC constitute a label-swapping scheme similar to that used for
- ATM networks, except that the assocation setup is implicit. If a translator
- loses state (say, after rebooting), the mapping is simply reestablished as
- packets arrive from end systems or other translators. Until the receivers
- timeout, a single source may appear twice and there may be a temporary
- confusion of sources and their descriptors.
-
- The rules are:
-
-
- end system: An end system never inserts CSRC options and typically does not
- insert an SSRC option. An end system application may insert an SSRC
- option if it originates more than one stream for a single conference
- through a single network and transport address, e.g., a single UDP
- port. The SDES option contains a zero for the identifier, indicating
- that the receiver is to much on network address only. The receiver
- determines the synchronization source as the concatenation of network
- source and synchronization source.
-
- bridge: A bridge assigns each source its own CSRC identifier (non-zero),
- which is then used also in the SDES option.
-
- translator: The translator maintains a list of all incoming sources, with
- their network and SSRC, if present. Sources without SSRC are assigned
- an SSRC equal to zero. Each of these sources is assigned a new local
- identifier, which is then inserted into the SSRC option.
-
-
- Local identifiers have advantages: the length of the identifiers within
- the packet are significantly shorter (four to six vs. at least ten
- bytes with padding); comparison of content and synchronization source
- are quicker (integer comparison vs. variable-length string comparison).
- The identifiers are meaningless for debugging. In particular, it is
- not easy for the receiver sitting behind a translator and a bridge to
- determine where a bridge is located, unless the bridge identifies itself
- periodically, possibly with another SDES-like option containing the actual
- network address.
-
- The major drawbacks appear to be the additional translator complexity:
- translators needs to maintain a mapping from incoming network/SSRC to
- outgoing SSRC.
-
- Note that using IP addresses as ``random'' local identifiers is not workable
- if there is any possibility that two sources participating in the same
- conference can coexist on the same host.
-
- A somewhat contrived scenaria is shown in Fig. 4.
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 36]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
-
- Figure only available in PostScript version.
- Figure 4: Complicated topology with translators (R) and bridges (G)
-
- 3.9 Energy Indication
-
-
- G.764 contains a 4-bit noise energy field, which encodes the white noise
- energy to be played by the receiver in the silences between talkspurts.
- Playing silence periods as white noise reduces the noise-pumping where the
- background noise audible during the talkspurt is audibly absent at the
- receiver during silence periods. Substituting white noise for silence
- periods at the receiver is not recommended for multi-party conferences, as
- the summed background noise from all silent parties would be distractive.
- Determining the proper noise level appears to be difficult. It is suggested
- that the receiver simply takes the energy of the last packet received before
- the beginning of a silence period as an indication of the background noise.
- With this mechanism, an explicit indication in the packet header is not
- required.
-
-
- 3.10 Error Control
-
-
- In principle, the receiver has four choices in handling packets with bit
- errors [15]:
-
-
- no checking: the receiver provides no indication whether a data packet
- contains bit errors, either because a checksum is not present or is not
- checked.
-
- discard: the receiver discards errored packets, with no indication to the
- application.
-
- receive: the receiver delivers and flags errored packets to the
- application.
-
- correct: the receiver drops errored packets and requests retransmission.
-
-
- It remains to be decided whether the header, the whole packet or neither
- should be protected by checksums. NVP protects its header only, while G.764
- has a single 16-bit check sequence covering both datalink and packet voice
- header. However, if UDP is used as the transport protocol, a checksum over
- the whole packet is already computed by the receiver. (Checksumming for UDP
- can typically be disabled by the sending or receiving host, but usually not
- on a per-port basis.) ST-II does not compute checksums for its payload.
- Many data link protocols already discard packets with bit errors, so that
-
-
- H. Schulzrinne Expires 03/01/94 [Page 37]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- packets are rarely rejected due to higher-layer checksums.
-
- Bit errors within the data part may be easier to tolerate than a lost
- packet, particularly since some media encoding formats may provide built-in
- error correction. The impact of bit errors within the header can vary; for
- example, errors within the timestamp may cause the audio packet to be played
- out at the wrong time, probably much more noticeable than discarding the
- packet. Other noticeable effects are caused by a wrong flow or encoding
- identifier. If a separate checksum is desired for the cases where the
- underlying protocols do not already provide one, it should be optional.
- Once optional, it would be easy to define several checksum options, covering
- just the header, the header plus a certain part of the body or the whole
- packet.
-
- A checksum can also be used to detect whether the receiver has the correct
- decryption key, avoiding noise or (worse) denial-of-service attacks. For
- that application, the checksum should be computed across the whole packet,
- before encrypting the content. Alternatively, a well-known signature could
- be added to the packet and included in the encryption, as long as known
- plaintext does not weaken the encryption security.
-
- Embedding a checksum as an option may lead to undiscovered errors if
- the the presence of the checksum is masked by errors. This can occur
- in a number of ways, for example by an altered option type field, a
- final-option bit erroneously set in options prior to the checksum option or
- an erroneous field length field. Thus, it may be preferable to prefix
- the RTP packet with a checksum as part of the specification of running
- RTP over some network or transport protocol. To avoid the overhead of
- including a checksum even in the common case where it is not needed, it
- might be appropriate to distinguish two RTP protocol variations through the
- next-protocol value in the lower-layer protocol header; the first would
- include a checksum, the second would not. The checksum itself offers a
- number of encoding possibilities(8) :
-
-
- o have two 16-bit checksums, one covering the header, the other the data
- part
-
- o combine a 16-bit checksum with a byte count indicating its coverage,
- thus allowing either a header-only or a header-plus-data checksum
-
-
- The latter has the advantage that the checksum can be computed without
- determining the header length.
-
- The error detection performance and computational cost of some common 16-bit
- checksumming algorithms are summarized in Table 4. The implementations were
- drawn from [16] and compiled on a SPARC IPX using the Sun ANSI C compiler
- ------------------------------
- 8. suggested by S. Casner
-
-
- H. Schulzrinne Expires 03/01/94 [Page 38]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- with optimization. The checksum computation was repeated 100 times;
- thus, due to data cache effects, the execution times shown are probably
- better than would be measured in an actual application. The relative
- performance, however, should be similar. Among the algorithms, the CRC has
- the strongest error detection properties, particularly for burst errors,
- while the remaining algorithms are roughly equivalent [16]. The Fletcher
- algorithm with modulo 255 (shown here) has the peculiar property that a
- transformation of a byte from 0 to 255 remains undetected. CRC, the IP
- checksum and Fletcher's algorithm cannot detect spurious zeroes at the end
- of a variable-length message [17]. The non-CRC checksums have the advantage
- that they can be updated incrementally if only a few bytes have changed.
- The latter property is important for translators that insert synchronization
- source indicators.
-
- algorithm ms
- IP checksum 0.093
- Fletcher's algorthm, optimized [17] 0.192
- CRC CCITT 0.310
- Fletcher's algorithm, non-optimized [18] 2.044
-
-
- Table 4: Execution time of common 16-bit checksumming algorithms, for a
- 1024-byte packet, in milliseconds
-
-
- 3.11 Security and Privacy
-
-
- 3.11.1 Introduction
-
-
- The discussions in this sections are based on the work of the privacy
- enhanced mail (PEM) working group within the Internet Engineering Task
- Force, as documented in [19,20] and related documents. The reader is
- referred to RFC 1113 [19] or its successors for terminology. Also relevant
- is the work on security for SNMP Version 2. We discuss here how the
- following security-related services may be implemented for packet voice and
- video:
-
-
- Confidentiality: Measures that ensure that only the intended receiver(s)
- can decode the received audio/video data; for others, the data contains
- no useful information.
-
- Authentication: Measures that allow the receiver(s) to ascertain the
- identity of the sender of data or to verify that the claimed originator
- is indeed the originator of the data.
-
- Message integrity: Measures that allow the receiver(s) to detect whether
- the received data has been altered.
-
- H. Schulzrinne Expires 03/01/94 [Page 39]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- As for PEM [19], the following privacy-related concerns are not addressed at
- this time:
-
-
- o access control
-
- o traffic flow confidentiality
-
- o routing control
-
- o assurance of data receipt and non-deniability of receipt
-
- o duplicate detection, replay prevention, or other stream-oriented
- services
-
-
- These services either require connection-oriented services or support from
- the lower layers that is currently unavailable. A reasonable goal is to
- provide privacy at least equivalent to that provided by the public telephone
- system (except for traffic flow confidentiality).
-
- As for privacy-enhanced mail, the sender determines which privacy
- enhancements are to be performed for a particular part of a data
- transmission. Therefore, mechanisms should be provided that allow the
- sender to determine whether the desired recipients are equipped to process
- any privacy-enhancements. This is functionally similar to the negotiation
- of, say, media encodings and should probably be handled by similar
- mechanisms. It is anticipated that privacy-enhanced mail will be used
- in the absence of or in addition to session establishment protocols and
- agents to distributed keys or negotiate the enhancements to be used during a
- conference.
-
-
- 3.11.2 Confidentiality
-
-
- Only data encryption can provide confidentiality as long as intruders can
- monitor the channel. It is desirable to specify an encryption algorithm and
- provide implementations without export restrictions. Although DES is widely
- available outside the United States, its use within software in both source
- and binary form remains difficult.
-
- We have the choice of either encrypting and/or authenticating the whole
- packet or only the options and payload. Encrypting the fixed header denies
- the intruder knowledge about some conference details (such as timing and
- format) and protects against replay attacks. Encrypting the fixed header
- also allows some heuristic detection of key mismatches, as the version
- identifier, timestamp and other header information are somewhat predictable.
- However, header encryption makes packet traces and debugging by external
- programs difficult. Also, since translators may need to inspect and modify
- the header, but do not have access to the sender's key, at least part of
-
- H. Schulzrinne Expires 03/01/94 [Page 40]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- the header needs to remain unencrypted, with the ability for the receiver
- to discern which part has been encrypted. Given these complications and
- the uncertain benefits of header encryption, it appears appropriate to limit
- encryption to the options and payload part only.
-
- In public key cryptography, the sender uses the receiver's public key for
- encryption. Public key cryptography does not work for true multicast
- systems since the public encoding key for every recipient differs, but it
- may be appropriate when used in two-party conversations or application-level
- multicast. In that case, mechanisms similar to privacy enhanced mail will
- probably be appropriate. Key distribution for symmetric-key encryption such
- as DES is beyond the scope of this recommendation, but the services of
- privacy enhanced mail [19,21] may be appropriate.
-
- For one-way applications, it may desirable to prohibit listeners from
- interrupting the broadcast. (After all, since live lectures on campus
- get disrupted fairly often, there is reason to fear that a sufficiently
- controversial lecture carried on the Internet could suffer a similar fate.)
- Again, asymmetric encryption can be used. Here, the decryption key is
- made available to all receivers, while the encryption key is known only
- to the legitimate sender. Current public-key algorithms are probably too
- computationally intensive for all but low-bit-rate voice. In most cases,
- filtering based on sources will be sufficient.
-
-
- 3.11.3 Message Integrity and Authentication
-
-
- The usual message digest methods are applicable if only the integrity of the
- message is to be protected against tampering. Again, services similar to
- that of privacy-enhanced mail [22] may be appropriate. The MD5 message
- digest [23] appears suitable. It translates any size message into a 128-bit
- (16-byte) signature. On a SPARCstation IPX (Sun 4/50), the computation
- of a signature for a 180-byte audio packet takes approximately 0.378 ms(9)
- Defining the signature to apply to all data beginning at the signature
- option allows operation when translators change headers. The receiver has
- to be able to locate the public key of the claimed sender. This poses two
- problems: first, a way of identifying the sender unambiguously needs to be
- found. The current methods of identification, such as the SMTP (e-mail)
- address, are not unambiguous. Use of a distinguished name as described in
- RFC 1255 [24] is suggested.
-
- The authentication process is described in RFC 1422 [21]:
- ------------------------------
- 9. The processing rates for Sun 4/50 (40 MHz clock) and SPARCstation 10's
- (36 MHz clock) are 0.95 and 2.2 MB/s, respectively, measured for a single
- 1000-byte block. Note that timing the repeated application of the algorithm
- for the same block of data gives optimistic results since the data then
- resides in the cache.
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 41]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- In order to provide message integrity and data origin
- authentication, the originator generates a message integrity code
- (MIC), signs (encrypts) the MIC using the private component of his
- public-key pair, and includes the resulting value in the message
- header in the MIC-Info field. The certificate of the originator
- is (optionally) included in the header in the Certificate field
- as described in RFC 1421. This is done in order to facilitate
- validation in the absence of ubiquitous directory services. Upon
- receipt of a privacy enhanced message, a recipient validates the
- originator's certificate (using the IPRA public component as the
- root of a certification path), checks to ensure that it has not
- been revoked, extracts the public component from the certificate,
- and uses that value to recover (decrypt) the MIC. The recovered
- MIC is compared against the locally calculated MIC to verify the
- integrity and data origin authenticity of the message.
-
-
- For audio/video applications with loose control, the certificate could be
- carried periodically to allow new listeners to obtain it and to achieve a
- measure of reliability.
-
- Symmetric key methods such as DES can also be used. Here, the key is
- simply prefixed to the message when computing the message digest (MIC), but
- not transmitted. The receiver has to obtain the sender's key through a
- secure channel, e.g., a PEM message. The method has the advantage that no
- cryptography is involved, thus alleviating export-control concerns. It is
- used for SNMP Version 2 authentication.
-
-
- 3.12 Security for RTP vs. PEM
-
-
- It is the author's opinion that RTP should aim to reuse as much of the
- PEM technology and syntax as possible, unless there are strong reasons in
- the nature of real-time traffic to deviate. This has the advantage that
- terminology, implementation experience, certificate mechanisms and possibly
- code can be reused. Also, since it is hoped that RTP finds use in a range
- of applications, a broad spectrum of security mechanisms should be provided,
- not necessarily limited by what is appropriate for large-distribution audio
- and video conferences.
-
- It should be noted that connection-oriented security architectures are
- probably unsuitable for RTP applications as they rely on reliable stream
- transmission and an explicit setup phase with typically only a single sender
- and receiver.
-
- There are a number of differences between the security requirements of PEM
- and RTP that should be kept in mind:
-
-
- Transparency: Unlike electronic mail, it is safe to assume that the channel
-
- H. Schulzrinne Expires 03/01/94 [Page 42]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- will carry 8 bit data unaltered. Thus, a conversion to a canonical
- form or encoding binary data into a 64-element subset as done for PEM
- is not required.
-
- Time: As outlined at the beginning of this document, processing speed and
- packet overhead have to be major considerations, much more so than with
- store-and-forward electronic mail. Message digest algorithms and DES
- can be implemented sufficiently fast even in software to be used for
- voice and possibly for low-bit rate video. Even for short signatures,
- RSA encryption is fairly slow.
-
- Note that the ASN.1/BER encoding of asymmetrically-encrypted MICs and
- certificates adds no significant processing load. For the MICs, the
- ASN.1 algorithm yields only additional constant bytes which a paranoid
- program can check, but does not need to decode. Certificates are
- carried much more infrequently and are relatively simple structures.
- It would seem unnecessary to supply a complete ASN.1/BER parser for any
- of the datastructures.
-
- Space: Encryption algorithm require a minimum data input equal to their
- keylength. Thus, for the suggested key length for RSA encryption
- of 508 to 1024 bits, the 16-byte message digest expands to a 53
- to 128 byte MIC. This is clearly rather burdensome for short audio
- packets. Applying a single message digest to several packets seems
- possible if the packet loss rates are sufficiently low, even though it
- does introduce minor security risks in the case where the receiver is
- forced to decide between accepting as authentic an incomplete sequence
- of packets or rejecting the whole sequence. Note that it would
- not be necessary to wait with playback until a complete authenticated
- block has been received; in general, a warning that authentication has
- failed would be sufficient for human users. The application should
- also issue a warning if no complete block could be authenticated for
- several blocks, as that might indicate that an impostor was feigning
- the presence of MIC-protected data by strategically dropping packets.
-
- The initialization vector for DES in cipher block mode adds another
- eight bytes.
-
- Scale: The symmetric key authentication algorithm used by PEM does not
- scale well for a large number of receivers as the message has to
- contain a separate MIC for each receiver, encrypted with the key for
- that particular sender-receiver pair. If we forgo the ability to
- authenticate an individual user, a single session key shared by all
- participants can thwart impostors from outside the group holding the
- shared secret.
-
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 43]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.13 Quality of Service Control
-
-
- Because real-time services cannot afford retransmissions, they are directly
- affected by packet loss and delays. Delay jitter and packet loss, for
- example, provide a good indication of network congestion and may suggest
- switching to a lower-bandwidth coding. To aid in fault isolation and
- performance monitoring, quality-of-service (QOS) measurement support is
- useful. QOS of service monitoring is useful for the receiver of real-time
- data, the sender of that data and possibly a third-party monitor, e.g.,
- the network provider, that is itself not part of the real-time data
- distribution.
-
-
- 3.13.1 QOS Measures
-
-
- For real-time services, a number of QOS measures are of interest, roughly in
- order of importance:
-
-
- o packet loss
-
- o packet delay variation (variance, minimum/maximum)
-
- o relative clock drift (delay between sender and receiver timestamp)
-
-
- In the following, the terms receiver and sender pertain to the real-time
- data, not any returned QOS data. If the receiver is to measure packet loss,
- an indication of the number of packets actually transmitted is required.
- If the receiver itself does not need to compute packet loss percentages,
- it is sufficient for the receiver to indicate to the sender the number of
- packets received and the range timestamps covered, thus avoiding the need
- for sequence numbers. Translation into loss at the sender is somewhat
- complicated, however, unless restrictions on permissible timestamps (e.g.,
- those starting a synchronization unit) are enforced. If sequence numbers
- are available, the receiver has to track the number of times that the
- sequence number has wrapped around, even in the face of packet reordering.
- If c denotes the cycle count, M the sequence number modulus and s the
- n
- sequence number of the n received packet, where s is not necessarily
- n
- larger than s , we can write:
- n-1
-
- c =c +1 for -M<s -s <-M=2
- n n
- n-1 n-1
- c =c -1 for M=2<s -s <M
- n n
- n-1 n-1
- c =c otherwise
- n
- n-1
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 44]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- For example, the sequence number sequence 65534;2;65535;1;3;5;4 would
- yield the cycle number sequence 0;1;0;1;1;1;1 for M=65536, i.e., 16-bit
- sequence numbers. The total number of expected packets is then computed
- simply as s +M*c -s +1, where the first received packet has index 0.
- n n
- 0
- The user of the measurements should also have some indication as to the time
- period they cover so that the degree of confidence in these statistical
- meassurements can be established.
-
-
- 3.13.2 Remote measurements
-
-
- It may be desirable for the sender, interested multicast group members
- or a non-group member (third party) to have automatic access to
- quality-of-service measurements. In particular, it is necessary for the
- sender to gather a number of reception reports from different parts of the
- Internet to ``triangulate'' where packets get lost or delayed.
-
- Two modes of operation can be distinguished: monitor-driven or
- receiver-driven. In the monitor-driven case, a site interested in QOS data
- for a particular sender contacts the receiver through a back channel and
- requests a reception report. Alternatively, each site can send reception
- reports to a monitoring multicast group or as session data, along with
- the ``regular station identification'' to the same multicast group used
- for data. The first approach requires the most implementation effort,
- but produces the least amount of data. The other two approaches have
- complementary properties.
-
- In most cases, sender-specific quality of service information is more useful
- for tracking network problems than aggregrate data for all senders. Since
- a site cannot transmit reception reports for all senders it has ever heard
- from, some selection mechanism is needed, such as most-recently-heard or
- cycling through sites.
-
- Source identification poses some difficulties since the network address seen
- by the receiver may not be meaningful to other members of the multicast
- group, e.g., after IP-SIP address translation. On the other hand, network
- addresses are easier to correlate with other network-level tools such as
- those used for Mbone mapping.
-
- minimum and maximum difference between departure and arrival timestamp.
- This has the advantage that the fixed delay can also be estimated if
- sender and receiver clocks are known to be synchronized. Unfortunately,
- delay extrema are noisy measurement that give only limited indication of
- the delay variability. The receiver could also return the playout delay
- value it uses, although for absolute timing, that again depends on the
- clock differential, as well as on the particular delay estimation algorithm
- employed by the receiver. In summary, a minimal set of useful measurements
- appears to be the expected and received packet count, combined with the
- minimum and maximum timestamp difference.
-
- H. Schulzrinne Expires 03/01/94 [Page 45]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 3.13.3 Monitoring by Third Party
-
-
- Except for delay estimates based on sequence number ranges, the above
- section applies for this case as well.
-
-
- 4 Conference Control Protocol
-
-
- Currently, only conference control functions used for loosely controlled
- conferences (open admission, no explicit conference set-up) have been
- considered in depth. Support for the following functionality needs to be
- specified:
-
-
- o authentication
-
- o floor control, token passing
-
- o invitations, calls
-
- o call forwarding, call transfer
-
- o discovery of conferences and resources (directory service)
-
- o media, encoding and quality-of-service negotiation
-
- o voting
-
- o conference scheduling
-
- o user locator
-
-
- The functional specification of a conference control protocol is beyond the
- scope of this memorandum.
-
-
- 5 The Use of Profiles
-
-
- RTP is intended to be a rather 'thin' protocol, partially because it aims
- to serve a wide variety of real-time services. The RTP specification
- intentionally leaves a number of issues open for other documents (profiles),
- which in turn have the goal of making it easy to build interoperable
- applications for a particular application domain, for example, audio and
- video conferences.
-
- Some of the issues that a profile should address include:
-
- H. Schulzrinne Expires 03/01/94 [Page 46]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- o the interpretation of the 'content' field with the CDESC option
-
- o the structure of the content-specific part at the end of the CDESC
- option
-
- o the mechanism by which applications learn about and define the mapping
- between the 'content' field in the RTP fixed header and its meaning
-
- o the use of the optional framing field prefixed to RTP packets (not
- used, used only if underlying transport protocol does not provide
- framing, used by some negotiation mechanism, always used)
-
- o any RTP-over-x issues, that is, definitions needed to allow RTP to use
- a particular underlying protocol
-
- o content-specific RTP, RTCP or reverse control options
-
- o port assignments for data and reverse control
-
-
- 6 Port Assignment
-
-
- Since it is anticipated that UDP and similar port-oriented protocols will
- play a major role in carrying RTP traffic, the issue of port assignment
- needs to be addressed. The way ports are assigned mainly affects how
- applications can extract the packets destined for them. For each medium,
- there also needs to be a mechanism for distinguishing data from control
- packets.
-
- For unicast UDP, only the port number is available for demultiplexing.
- Thus, each media will need a separate port number pair unless a separate
- demultiplexing agent is used. However, for one-to-one connections,
- dynamically negotiating a port number is easy. If several UDP streams are
- used to provide multicast by transport-level replication, the port number
- issue becomes somewhat more difficult. For ST-II, a common port number has
- to be agreed upon by all participants, which may be difficult particularly
- if a new site wants to join an on-going connection, but is already using the
- port number in a different connection.
-
- For UDP multicast, an application can select to receive only packets with a
- particular port number and multicast address by binding to the appropriate
- multicast address(10) . Thus, for UDP multicast, there is no need to
- distinguish media by port numbers, as each medium could have its designated
- and unique multicast group. Any dynamic port allocation mechanism would
- fail for large, dynamic multicast groups, but might be appropriate for small
- ------------------------------
- 10. This extension to the original multicast socket semantics is currently
- in the process of being deployed.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 47]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- conferences and two-party conversations.
-
- Data and control packets for a single medium can either share a single
- port or use two different port numbers. (Currently, two adjacent port
- numbers, 3456 and 3457, are used.) A single port for data and control
- simplifies the receiver code and translators and, less important, conserves
- port numbers. With the proliferation of firewalls, limiting the number of
- ports has assumed additional importance. Sharing a single port requires
- some other means of identifying control packets, for example as a special
- encoding code. Alternatively, all control data could be carried as options
- within data packets, akin to the NVP protocol options. Since control
- messages are also transmitted if no actual medium data is available, header
- content of packets without media data needs to be determined. With the use
- of a synchronization bit, the issue of how sequence numbers and timestamps
- are to be treated for these packets is less critical. It is suggested to
- use a zero timestamp and to increment the sequence number normally. Due to
- the low bandwidth requirements of typical control information, the issue of
- accomodating control information in any bandwidth reservation scheme should
- be manageable. The penalty paid is the eight-byte overhead of the RTP
- header for control packets that do not require time stamps, encoding and
- sequence number information.
-
- Using a single RTCP stream for several media may be advantageous to
- avoid duplicating, for example, the same identification information for
- voice, video and whiteboard streams. This works only if there is one
- multicast group that all members of a conference subscribe to. Given
- the relatively low frequency of control messages, the coordination effort
- between applications and the necessity to designate control messages for a
- particular medium are probably reasons enough to have each application send
- control messages to the same multicast group as the data.
-
- In conclusion, for multicast UDP, one assigned port number, for both data
- and control, seems to offer the most advantages, although the data/control
- split may offer some bandwidth savings.
-
-
- 7 Multicast Address Allocation
-
-
- A fixed, permanent allocation of network multicast addresses to invidual
- conferences by some naming authority such as the Internet Assigned Numbers
- Authority is clearly not feasible, since the lifetime of conferences is
- unknown, the potential number of conferences is rather large and the
- 28 16
- available number space limited to about 2 , of which 2 have been set
- aside for dynamic allocation by conferences.
-
- The alternative to permanent allocation is a dynamic allocation, where an
- initiator of a multicast application obtains an unused multicast address in
- some manner (discussed below). The address is then made available again,
- either implicitly or explicitly, as the application terminates.
-
-
- H. Schulzrinne Expires 03/01/94 [Page 48]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- The address allocation may or may not be handled by the same mechanism that
- provides conference naming and discovery services. Separating the two has
- the advantage that dynamic (multicast) address allocation may be useful
- to applications other than conferencing. Also, different mechanisms (for
- example, periodic announcements vs. servers) may be appropriate for each.
-
- We can distinguish two methods of multicast address assignment:
-
-
- function-based: all applications of a certain type share a common, global
- address space. Currently, a reservation of a 16-bit address space for
- conferences is one example. The advantage of this scheme is that
- directory functions and allocation can be readily combined, as is done
- in the sd tool by Van Jacobson. A single namespace spanning the
- globe makes it necessary to restrict the scope of addresses so that
- allocation does not require knowing about and distributing information
- about the existence of all global conferences.
-
- hierarchical: Based on the location of the initiator, only a subset of
- addresses are available. This limits the number of hosts that
- could be involved in resolving collisions, but, like most hierarchical
- assignment, leads to sparse allocation. Allocation is independent of
- the function the address is used for.
-
-
- Clearly, combinations are possible, for example, each local namespace could
- be functionally divided if sufficiently large. With the current allocation
- 16
- of 2 addresses to conferences, hierarchical division except on a very
- coarse scale is not feasible.
-
- To a limited extent, multicast address allocation can be compared to the
- well-known channel multiple access problem. The multicast address space
- plays the role of the common channel, with each address representing a time
- slot.
-
- All the following schemes require cooperation from all potential users of
- the address space. There is no protection against an ignorant or malicious
- user joining a multicast group.
-
-
- 7.1 Channel Sensing
-
-
- In this approach, the initiator randomly selects a multicast address from a
- given range, joins the multicast group with that address and listens whether
- some other host is already transmitting on that address. This approach does
- not require a separate address allocation protocol or an address server,
- but it is probably infeasible for a number of reasons. First, a user
- process can only bind to a single port at one time, making 'channel sensing'
- difficult. Secondly, unlike listening to a typical broadcast channel, the
-
-
- H. Schulzrinne Expires 03/01/94 [Page 49]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- act of joining the multicast group can be quite expensive both for the
- listening host and the network. Consider what would happen if a host
- attached through a low-bandwidth connection joins a multicast group carrying
- video traffic, say.
-
- Channel sensing may also fail if two sections of the network that were
- separated at the time of address allocation rejoin later. Changes in
- time-to-live values can make multicast groups 'visible' to hosts that
- previously were outside their scope.
-
-
- 7.2 Global Reservation Channel with Scoping
-
-
- Each range of multicast addresses has an associated well-known multicast
- address and port where all initiators (and possibly users) advertise the use
- of multicast addresses. An initiator first picks a multicast address at
- random, avoiding those already known to be in use. Some mechanism for
- collision resolution has to be provided in the unlikely event that two
- initiators simultaneously choose the same address. Also, since address
- advertisement will have to be sent at fairly long intervals to keep traffic
- down, an application wanting to start a conference, for example, has to
- wait for an extended period of time unless it continuously monitors the
- allocation multicast group.
-
- To limit traffic, it may seem advisable to only have the initiator multicast
- the address usage advertisement. This, however, means that there needs to
- be a mechanism for another site to take over advertising the group if the
- initiator leaves, but the multicast group continues to exist. Time-to-live
- restrictions pose another problem. If only a single source advertises the
- group, the advertisement may not reach all those sites that could be reached
- by the multicast transmissions themselves.
-
- The possibility of collisions can be reduced by address reuse with scoping,
- discussed further below, and by adding port numbers and other identifiers
- as further discriminators. The latter approach appears to defeat the
- purpose of using multicast to avoid transmitting information to hosts that
- have no interest in receiving it. Routers can only filter based on group
- membership, not ports or other higher-layer demultiplexing identifiers.
- Thus, even though two conferences with the same multicast address and
- different ports, say, could coexist at the application layer, this would
- force hosts and networks that are interested in only one of the conferences
- to deal with the combined traffic of the two conferences.
-
-
- 7.3 Local Reservation Channel
-
-
- Instead of sharing a global namespace for each application, this scheme
- divides the multicast address space hierarchically, allowing an initiator
- within a given network to choose from a smaller set of multicast addresses,
-
- H. Schulzrinne Expires 03/01/94 [Page 50]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- but independent of the application. As with many allocation problems, we
- can devise both server-based and fully distributed versions.
-
-
- 7.3.1 Hierarchical Allocation with Servers
-
-
- By some external means, address servers, distributed throughout the network,
- are provided with non-overlapping regions of the multicast address space.
- An initiator asks its favorite address server for an address when needed.
- When it no longer needs the address, it returns it to the server. To
- prevent addresses from disappearing when the requestor crashes and looses
- its memory about allocated addresses, requests should have an associated
- time-out period. This would also (to some extent) cover the case that the
- initiator leaves the conference, without the conference itself disbanding.
- To decrease the chances that an initiator cannot be provided with an
- address, either the local server could 'borrow' an address from another
- server or could point the initiator to another server, somewhat akin to the
- methods used by the Domain Name Service (DNS). Provisions have to be made
- for servers that crash and may loose knowledge about the status of its block
- of addresses, in particular their expiration times. The impact of such
- failures could be mitigated by limiting the maximum expiration time to a few
- hours. Also, the server could try to request status by multicast from its
- clients.
-
-
- 7.3.2 Distributed Hierarchical Allocation
-
-
- Instead of a server, each network is allocated a set of multicast
- addresses. Within the current IP address space, both class A, B and C
- networks would get roughly 120 addresses, taking into account those that
- have been permanently assigned. Contention for addresses works like the
- global reservation channel discussed earlier, but the reservation group is
- strictly limited to the local network. (Since the address ranges are
- disjoint, address information that inadvertently leaks outside the network,
- is harmless.)
-
- This method avoids the use of servers and the attendant failure modes, but
- introduces other problems. The division of the address space leads to a
- barely adequate supply of addresses (although larger address formats will
- probably make that less of an issue in the future). As for any distributed
- algorithm, splitting of networks into temporarily unconnected parts can
- easily destroy the uniqueness of addresses. Handling initiators that leave
- on-going conferences is probably the most difficult issue.
-
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 51]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- 7.4 Restricting Scope by Limiting Time-to-Live
-
-
- Regardless of the address allocation method, it may be desirable to
- distinguish multicast addresses with different reach. A local address would
- be given out with the restriction of a maximum time-to-live value and could
- thus be reused at a network sufficiently removed, akin to the combination
- of cell reuse and power limitation in cellular telephony. Given that many
- conferences will be local or regional (e.g., broadcasting classes to nearby
- campuses of the same university or a regional group of universities, or an
- electronic town meeting), this should allow significant reuse of addresses.
- Reuse of addresses requires careful engineering of thresholds and would
- probably only be useful for very small time-to-live values that restrict
- reach to a single local area network. Using time-to-live fields to restrict
- scope rather than just prevent looping introduces difficult-to-diagnose
- failure modes into multicast sessions. In particular, reachability is no
- longer transitive, as B may have A and C in its scope, but A and B may be
- outside each other's scope (or A may be in the scope of B, but not vice
- versa, due to asymmetric routes, etc.). This problem is aggravated by the
- fact that routers (for obvious reasons) are not supposed to return ICMP time
- exceeded messages, so that the sender can only guess why multicast packets
- do not reach certain receivers.
-
-
- 8 Security Considerations
-
-
- Security issues are discussed in Section 3.11.
-
-
- Acknowledgments
-
-
- This draft is based on discussion within the AVT working group chaired by
- Stephen Casner. Eve Schooler and Stephen Casner provided valuable comments.
-
- This work was supported in part by the Office of Naval Research under
- contract N00014-90-J-1293, the Defense Advanced Research Projects Agency
- under contract NAG2-578 and a National Science Foundation equipment grant,
- CERDCR 8500332.
-
-
- A Glossary
-
-
- The glossary below briefly defines the acronyms used within the text.
- Further definitions can be found in RFC 1392, ``Internet User's Glossary''.
- Some of the general Internet definitions below are copied from that
- glossary. The quoted passages followed by a reference of the form
-
-
- H. Schulzrinne Expires 03/01/94 [Page 52]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- ``(G.701)'' are drawn from the CCITT Blue Book, Fascicle I.3, Definitions.
- The glossary of the document ``Recommended Practices for Enhancing Digital
- Audio Compatibility in Multimedia Systems'', published by the Interactive
- Multimedia Association was used for some terms marked with [IMA]. The
- section on MPEG is based on text written by Mark Adler (Caltech).
-
-
- 4:1:1 Refers to degree of subsampling of the two chrominance signals with
- respect to the luminance signal. Here, each color difference component
- has one quarter the resolution of the luminance component.
-
- 4:2:2 Refers to degree of subsampling of the two chrominance signals with
- respect to the luminance signal. Here, each color difference component
- has half the resolution of the luminance component.
-
- 16/16 timestamp: a 32-bit integer timestamp consisting of a 16-bit field
- containing the number of seconds followed by a 16-bit field containing
- the binary fraction of a second. This timestamp can measure about 18.2
- hours with a resolution of approximately 15 microseconds.
-
- n=m timestamp: a n+m bit timestamp consisting of an n-bit second count and
- an m-bit fraction.
-
- ADPCM: Adaptive differential pulse code modulation. Rather than
- transmitting ! PCM samples directly, the difference between the
- estimate of the next sample and the actual sample is transmitted. This
- difference is usually small and can thus be encoded in fewer bits than
- the sample itself. The ! CCITT recommendations G.721, G.723, G.726
- and G.727 describe ADPCM encodings. ``A form of differential pulse
- code modulation that uses adaptive quantizing. The predictor may be
- either fixed (time invariant) or variable. When the predictor is
- adaptive, the adaptation of its coefficients is made from the quantized
- difference signal.'' (G.701)
-
- adaptive quantizing: ``Quantizing in which some parameters are made
- variable according to the short term statistical characteristics of the
- quantized signal.'' (G.701)
-
- A-law: a type of audio !companding popular in Europe.
-
- CCIR: Comite Consultativ International de Radio. This organization is
- part of the United Nations International Telecommunications Union (ITU)
- and is responsible for making technical recommendations about radio,
- television and frequency assignments. The CCIR has recently changed
- its name to ITU-TR; we maintain the more familiar name. !CCITT
-
- CCIR-601: The CCIR-601 digital television standard is the base for all the
- subsampled interchange formats such as SIF, CIF, QCIF, etc. For NTSC
- (PAL/SECAM), it is 720 (720) pixels by 243 (288) lines by 60 (50)
- fields per second, where the fields are interlaced when displayed.
- The chrominance channels horizontally subsampled by a factor of two,
-
- H. Schulzrinne Expires 03/01/94 [Page 53]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- yielding 360 (360) pixels by 243 (288) lines by 60 (50) fields a
- second.
-
- CCITT: Comite Consultatif International de Telegraphique et Telephonique
- (CCITT). This organization is part of the United Nations International
- Telecommunications Union (ITU) and is responsible for making technical
- recommendations about telephone and data communications systems. X.25
- is an example of a CCITT recommendation. Every four years CCITT holds
- plenary sessions where they adopt new recommendations. Recommendations
- are known by the color of the cover of the book they are contained in.
- (The 1988 edition is known as the Blue Book.) The CCITT has recently
- changed its name to ITU-TS; we maintain the familiar name. !CCIR
-
- CELP: code-excited linear prediction; audio encoding method for low-bit
- rate codecs; !LPC.
-
- CD: compact disc.
-
- chrominance: color information in a video image. For !H.261, color is
- encoded as two color differences: CR (``red'') and CB (``blue'').
- !luminance
-
- CIF: common interchange format; interchange format for video images with
- 288 lines with 352 pixels per line of luminance and 144 lines with 176
- pixel per line of chrominance information. !QCIF, SCIF
-
- CLNP: ISO connectionless network-layer protocol (ISO 8473), similar in
- functionality to !IP.
-
- codec: short for coder/decoder; device or software that ! encodes and
- decodes audio or video information.
-
- companding: contraction of compressing and expanding; reducing the dynamic
- range of audio or video by a non-linear transformation of the sample
- values. The best known methods for audio are mu-law, used in North
- America, and A-law, used in Europe and Asia. !G.711 For a given
- number of bits, companded data uses a greater number of binary codes to
- represent small signal levels than linear data, resulting in a greater
- dynamic range at the expense of a poorer signal-to-nose ratio. [25]
-
- DAT: digital audio tape.
-
- decimation: reduction of sample rate by removal of samples [IMA].
-
- delay jitter: Delay jitter is the variation in end-to-end network delay,
- caused principally by varying media access delays, e.g., in an
- Ethernet, and queueing delays. Delay jitter needs to be compensated
- by adding a variable delay (refered to as ! playout delay) at the
- receiver.
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 54]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- DVI: (trademark) digital video interactive. Audio/video compression
- technology developed by Intel's DVI group. [IMA]
-
- dynamic range: a ratio of the largest encodable audio signal to the
- smallest encodable signal, expressed in decibels. For linear audio
- data types, the dynamic range is approximately six times the number of
- bits, measured in dB.
-
- encoding: transformation of the media content for transmission, usually to
- save bandwidth, but also to decrease the effect of transmission errors.
- Well-known encodings are G.711 (mu-law PCM), and ADPCM for audio, JPEG
- and MPEG for video. ! encryption
-
- encryption: transformation of the media content to ensure that only the
- intended recipients can make use of the information. ! encoding
-
- end system: host where conference participants are located. RTP packets
- received by an end system are played out, but not forwarded to other
- hosts (in a manner visible to RTP).
-
- FIR: finite (duration) impulse response. A signal processing filter that
- does not use any feedback components [IMA].
-
- frame: unit of information. Commonly used for video to refer to a single
- picture. For audio, it refers to a data that forms a encoding unit.
- For example, an LPC frame consists of the coefficients necessary to
- generate a specific number of audio samples.
-
- frequency response: a system's ability to encode the spectral content of
- audio data. The sample rate has to be at least twice as large as the
- maximum possible signal frequency.
-
- G.711: ! CCITT recommendation for ! PCM audio encoding at 64 kb/s using
- mu-law or A-law companding.
-
- G.721: ! CCITT recommendation for 32 kbit/s adaptive differential pulse
- code modulation (! ADPCM, PCM).
-
- G.722: ! CCITT recommendation for audio coding at 64 kbit/s; the audio
- bandwidth is 7 kHz instead of 3.5 kHz for G.711, G.721, G.723 and
- G.728.
-
- G.723: ! CCITT recommendation for extensions of Recommendation G.721
- adapted to 24 and 40 kbit/s for digital circuit multiplication
- equipment.
-
- G.728: ! CCITT recommendation for voice coding using code-excited linear
- prediction (CELP) at 16 kbit/s.
-
- G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like
- data link and network layer. In the draft stage, this standard was
-
- H. Schulzrinne Expires 03/01/94 [Page 55]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- referred to as G.PVNP. The standard is primarily geared towards digital
- circuit multiplication equipment used by telephone companies to carry
- more voice calls on transoceanic links.
-
- G.821: ! CCITT recommendation for the error performance of an
- international digital connection forming part of an integrated services
- digital network.
-
- G.822: ! CCITT recommendation for the controlled !slip rate objective on
- an international digital connection.
-
- G.PVNP: designation of CCITT recommendation ! G.764 while in draft status.
-
- GOB: (H.261) groups of blocks; a !CIF picture is divided into 12 GOBs, a
- QCIF into 3 GOBs. A GOB is composed of 3 macro blocks (!MB) and
- contains luminance and chrominance information for 8448 pixels.
-
- GSM: Group Speciale Mobile. In general, designation for European mobile
- telephony standard. In particular, often used to denote the audio
- coding used. Formally known as the European GSM 06.10 provisional
- standard for full-rate speech transcoding, prI-ETS 300 036. It uses
- RPE/LTP (residual pulse excitation/long term prediction) at 13 kb/s
- using frames of 160 samples covering 20 ms.
-
- H.261: ! CCITT recommendation for the compression of motion video at rates
- of Px64 kb/s (where p=1:::30. Originally intended for narrowband
- !ISDN.
-
- hangover: [26] Audio data transmitted after the silence detector indicates
- that no audio data is present. Hangover ensures that the ends of
- words, important for comprehension, are transmitted even though they
- are often of low energy.
-
- HDLC: high-level data link control; standard data link layer protocol
- (closely related to LAPD and SDLC).
-
- IMA: Interactive Multimedia Assocation; trade association located in
- Annapolis, MD.
-
- ICMP: Internet Control Message Protocol; ICMP is an extension to the
- Internet Protocol. It allows for the generation of error messages,
- test packets and informational messages related to ! IP.
-
- in-band: signaling information is carried together (in the same channel or
- packet) with the actual data. ! out-of-band.
-
- interpolation: increase in sample rate by introduction of processed
- samples.
-
- IP: internet protocol; the Internet Protocol, defined in RFC 791, is the
- network layer for the TCP/IP Protocol Suite. It is a connectionless,
-
- H. Schulzrinne Expires 03/01/94 [Page 56]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- best-effort packet switching protocol [27].
-
- IP address: four-byte binary host interface identifier used by !IP for
- addressing. An IP address consists of a network portion and a
- host portion. RTP treats IP addresses as globally unique, opaque
- identifiers.
-
- IPv4: current version (4) of ! IP.
-
- ISDN: integrated services digital network; refers to an end-to-end circuit
- switched digital network intended to replace the current telephone
- network. ISDN offers circuit-switched bandwidth in multiples of 64
- kb/s (B or bearer channel), plus a 16 kb/s packet-switched data (D)
- channel.
-
- ISO: International Standards Organization. A voluntary, nontreaty
- organization founded in 1946. Its members are the national
- standardards organizations of the 89 member countries, including ANSI
- for the U.S. (Tanenbaum)
-
- ISO 10646: !ISO standard for the encoding of characters from all languages
- into a single 32-bit code space (Universal Character Set). For
- transmission and storage, a one-to-five octet code (UTF) has been
- defined which is upwardly compatible with US-ASCII.
-
- JPEG: ISO/CCITT joint photographic experts group. Designation of a
- variable-rate compression algorithm using discrete cosine transforms
- for still-frame color images.
-
- jitter: ! delay jitter.
-
- linear encoding: a mapping from signal values to binary codes where each
- binary level represents the same signal increment !companding.
-
- loosely controlled conference: Participants can join and leave the
- conference without connection establishment or notifying a conference
- moderator. The identity of conference participants may or may not be
- known to other participants. See also: tightly controlled conference.
-
- low-pass filter: a signal processing function that removes spectral content
- above a cutoff frequency. [IMA]
-
- LPC: linear predictive coder. Audio encoding method that models speech as
- a parameters of a linear filter; used for very low bit rate codecs.
-
- luminance: brightness information in a video image. For black-and-
- white (grayscale) images, only luminance information is required.
- !chrominance
-
- MB: (H.261) macroblock, consisting of six blocks, four eight-by-eight
- luminance blocks and two chrominance blocks.
-
- H. Schulzrinne Expires 03/01/94 [Page 57]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- MPEG: ISO/CCITT motion picture experts group JTC1/SC29/WG11. Designates a
- variable-rate compression algorithm for full motion video at low bit
- rates; uses both intraframe and interframe coding. It defines a bit
- stream for compressed video and audio optimized to fit into a bandwidth
- (data rate) of 1.5 Mbits/s. This rate is special because it is the
- data rate of (uncompressed) audio CD's and DAT's. The draft is in
- three parts, video, audio, and systems, where the last part gives the
- integration of the audio and video streams with the proper timestamping
- to allow synchronization of the two. MPEG phase II is to define a
- bitstream for video and audio coded at around 3 to 10 Mbits/s.
-
- MPEG compresses YUV SIF images. Motion is predicted from frame to
- frame, while DCTs of the difference signal with quantization make use
- of spatial redundancy. DCTs are performed on 8 by 8 blocks, the motion
- prediction on 16 by 16 blocks of the luminance signal. Quantization
- changes for every 16 by 16 macroblock.
-
- There are three types of coded frames. Intra (``I'') frames are coded
- without motion prediction, Predicted (``P'') frames are difference
- frames to the last P or I frame. Each macroblock in a P frame can
- either come with a vector and difference DCT coefficients for a close
- match in the last I or P frame, or it can just be intra coded (like
- in the I frames) if there was no good match. Lastly, there are "B"
- or bidirectional frames. They are predicted from the closest two I or
- P frames, one in the past and one in the future. These are searched
- for matching blocks in those frames, and three different things tried
- to see which works best: the forward vector, the backward vector, and
- the average of the two blocks from the future and past frames, and
- subtracting that from the block being coded. If none of those work
- well, the block is intra-coded.
-
- There are 12 frames from I to I, based on random access requirements.
-
- MPEG-1: Informal name of proposed !MPEG (ISO standard DIS 1172).
-
- media source: entity (user and host) that produced the media content.
- It is the entity that is shown as the active participant by the
- application.
-
- MTU: maximum transmission unit; the largest frame length which may be sent
- on a physical medium.
-
- Nevot: network voice terminal; application written by the author.
-
- network source: entity denoted by address and port number from which the !
- end system receives the RTP packet and to which the end system send any
- RTP packets for that conference in return.
-
- NTP timestamp: ``NTP timestamps are represented as a 64-bit unsigned
- fixed-point number, in seconds relative to 0 hours on 1 January 1900.
- The integer part is in the first 32 bits and the fraction part in the
-
- H. Schulzrinne Expires 03/01/94 [Page 58]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- last 32 bits.'' [13] NTP timestamps do not include leap seconds, i.e.,
- each and every day contains exactly 86,400 NTP seconds.
-
- NVP: network voice protocol; original packet format used in early packet
- voice experiments; defined in [1].
-
- octet: An octet is an 8-bit datum, which may contain values 0 through 255
- decimal. Commonly used in ISO and CCITT documents, also known as a
- byte.
-
- OSI: Open System Interconnection; a suite of protocols, designed by
- ISO committees, to be the international standard computer network
- architecture.
-
- out of band: signaling and control information is carried in a separate
- channel or separate packets from the actual data. For example, ICMP
- carries control information out-of-band, that is, as separate packets,
- for IP, but both ICMP and IP usually use the same communication channel
- (in band).
-
- parametric coder: coder that encodes parameters of a model representing the
- input signal. For example, LPC models a voice source as segments of
- voice and unvoiced speech, represented by a set of
-
- parametric coder: coder that encodes parameters of a model representing the
- input signal. For example, LPC models a voice source as segments of
- voice and unvoiced speech, represented by filter parameters. Examples
- include LPC, CELP and GSM. !waveform coder.
-
- PCM: pulse-code modulation; speech coding where speech is represented by a
- given number of fixed-width samples per second. Often used for the
- coding employed in the telephone network: 64,000 eight-bit samples per
- second.
-
- pel, pixel: picture element. ``Smallest graphic element that can be
- independently addressed within a picture; (an alternative term for
- raster graphics element).'' (T.411)
-
- playout: Delivery of the medium content to the final consumer within the
- receiving host. For audio, this implies digital-to-analog conversion,
- for video display on a screen.
-
- playout unit: A playout unit is a group of packets sharing a common
- timestamp. (Naturally, packets whose timestamps are identical due
- to timestamp wrap-around are not considered part of the same playout
- unit.) For voice, the playout unit would typically be a single voice
- segment, while for video a video frame could be broken down into
- subframes, each consisting of packets sharing the same timestamp and
- ordered by some form of sequence number. !synchronization unit
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 59]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- plesiochronous: ``The essential characteristic of time-scales or signals
- such that their corresponding significant instants occur at nominally
- the same rate, any variation in rate being constrained within specified
- limits. Two signals having the same nominal digit rate, but not
- stemming from the same clock or homochronous clocks, are usually
- plesiochronous. There is no limit to the time relationship between
- corresponding significant instants.'' (G.701, Q.9) In other words,
- plesiochronous clocks have (almost) the same rate, but possibly
- different phase.
-
- pulse code modulation (PCM): ``A process in which a signal is sampled, and
- each sample is quantized independently of other samples and converted
- by encoding to a digital signal.'' (G.701)
-
- PVP: packet video protocol; extension of ! NVP to video data [28]
-
- QCIF: quarter common interchange format; format for exchanging video images
- with half as many lines and half as many pixels per line as CIF, i.e.,
- luminance information is coded at 144 lines and 176 pixels per line.
- !CIF, SIF
-
- RTCP: real-time control protocol; adjunct to ! RTP.
-
- RTP: real-time transport protocol; discussed in this memorandum.
-
- sampling rate: ``The number of samples taken of a signal per unit time.''
- (G.701)
-
- SB: subband; as in subband codec. Audio or video encoding that splits the
- frequency content of a signal into several bands and encodes each band
- separately, with the encoding fidelity matched to human perception for
- that particular frequency band.
-
- SCIF: standard video interchange format; consists of four !CIF images
- arranged in a square. !CIF, QCIF
-
- SIF: standard interchange format; format for exchanging video images of 240
- lines with 352 pixels each for NTSC, and 288 lines by 352 pixels for
- PAL and SECAM. At the nominal field rates of 60 and 50 fields/s, the
- two formats have the same data rate. !CIF, QCIF
-
- slip: In digital communications, slip refers to bit errors caused by the
- different clock rates of nominally synchronous sender and receiver. If
- the sender clock is faster than the receiver clock, occasionally a bit
- will have to be dropped. Conversely, a faster receiver will need to
- insert extra bits. The problem also occurs if the clock rates of
- encoder and decoder are not matched precisely. Information loss can be
- avoided if the duration of pauses (silence periods between talkspurts
- or the inter-frame duration) can be adjusted by the receiver. ``The
- repetition or deletion of a block of bits in a synchronous or
- plesiochronous bit stream due to a discrepancy in the read and write
-
- H. Schulzrinne Expires 03/01/94 [Page 60]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- rates at a buffer.'' (G.810) !G.821, G.822
-
- ST-II: stream protocol; connection-oriented unreliable, non-sequenced
- packet-oriented network and transport protocol with process demulti-
- plexing and provisions for establishing flow parameters for resource
- control; defined in RFC 1190 [29,30].
-
- Super CIF: video format defined in Annex IV of !H.261 (1992), comprising
- 704 by 576 pixels.
-
- synchronization unit: A synchronization unit consists of one or more
- !playout units that, as a group, share a common fixed delay between
- generation and playout of each part of the group. The delay may change
- at the beginning of such a synchronization unit. The most common
- synchronization units are talkspurts for voice and frames for video
- transmission.
-
- TCP: transmission control protocol; an Internet Standard transport layer
- protocol defined in RFC 793. It is connection-oriented and
- stream-oriented, as opposed to UDP [31].
-
- TPDU: transport protocol data unit.
-
- tightly controlled conference: Participants can join the conference only
- after an invitation from a conference moderator. The identify of all
- conference participants is known to the moderator. !loosely controlled
- conference.
-
- transcoder: device or application that translates between several
- encodings, for example between ! LPC and ! PCM.
-
- UDP: user datagram protocol; unreliable, non-sequenced connectionless
- transport protocol defined in RFC 768 [32].
-
- vat: visual audio tool written by Steve McCanne and Van Jacobson, Lawrence
- Berkeley Laboratory.
-
- vt: voice terminal software written at the Information Sciences Institute.
-
- VMTP: Versatile message transaction protocol; defined in RFC 1045 [33].
-
- waveform coder: a coder that tries to reproduce the waveform after
- decompression; examples include PCM and ADPCM for audio and video and
- discrete-cosine-transform based coders for video; !parametric coder.
-
- Y: Common abbreviation for the luminance or luma signal.
-
- YCbCr: YCbCr coding is employed by D-1 component video equipment.
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 61]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- B Address of Author
-
-
- Henning Schulzrinne
- AT&T Bell Laboratories
- MH 2A244
- 600 Mountain Avenue
- Murray Hill, NJ 07974-0636
- telephone: +1 908 582 2262
- facsimile: +1 908 582 5809
- electronic mail: hgs@research.att.com
-
-
- References
-
-
- [1] D. Cohen, ``A network voice protocol: NVP-II,'' technical report,
- University of Southern California/ISI, Marina del Ray, California,
- Apr. 1981.
-
- [2] N. Borenstein and N. Freed, ``MIME (multipurpose internet mail
- extensions) mechanisms for specifying and describing the format of
- internet message bodies,'' Network Working Group Request for Comments
- RFC 1341, Bellcore, June 1992.
-
- [3] R. Want, A. Hopper, V. Falcao, and J. Gibbons, ``The active badge
- location system,'' ACM Transactions on Information Systems, vol. 10,
- pp. 91--102, Jan. 1992.
-
- [4] R. Want and A. Hopper, ``Active badges and personal interactive
- computing objects,'' Technical Report ORL 92-2, Olivetti Research,
- Cambridge, England, Feb. 1992. also in IEEE Transactions on Consumer
- Electronics, Feb. 1992.
-
- [5] J. G. Gruber and L. Strawczynski, ``Subjective effects of variable
- delay and speech clipping in dynamically managed voice systems,'' IEEE
- Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985.
-
- [6] N. S. Jayant, ``Effects of packet losses in waveform coded speech and
- improvements due to an odd-even sample-interpolation procedure,'' IEEE
- Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981.
-
- [7] D. Minoli, ``Optimal packet length for packet voice communication,''
- IEEE Transactions on Communications, vol. COM-27, pp. 607--611, Mar.
- 1979.
-
- [8] V. Jacobson, ``Compressing TCP/IP headers for low-speed serial
- links,'' Network Working Group Request for Comments RFC 1144, Lawrence
- Berkeley Laboratory, Feb. 1990.
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 62]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- [9] P. Francis, ``A near-term architecture for deploying Pip,'' IEEE
- Network, vol. 7, pp. 30--37, May 1993.
-
- [10] IMA Digital Audio Focus and Technical Working Groups, ``Recommended
- practices for enhancing digital audio compatibility in multimedia
- systems,'' tech. rep., Interactive Multimedia Association, Annapolis,
- Maryland, Oct. 1992.
-
- [11] W. A. Montgomery, ``Techniques for packet voice synchronization,''
- IEEE Journal on Selected Areas in Communications, vol. SAC-1,
- pp. 1022--1028, Dec. 1983.
-
- [12] D. Cohen, ``A protocol for packet-switching voice communication,''
- Computer Networks, vol. 2, pp. 320--331, September/October 1978.
-
- [13] D. L. Mills, ``Network time protocol (version 3) -- specification,
- implementation and analysis,'' Network Working Group Request for
- Comments RFC 1305, University of Delaware, Mar. 1992.
-
- [14] ISO/IEC JTC 1, ISO/IEC DIS 11172: Information technology --- coding
- of moving pictures and associated audio for digital storage media up
- to about 1.5 Mbit/s. International Organization for Standardization
- and International Electrotechnical Commission, 1992.
-
- [15] L. Delgrossi, C. Halstrick, R. G. Herrtwich, and H. St"uttgen, ``HeiTP:
- a transport protocol for ST-II,'' in Proceedings of the Conference on
- Global Communications (GLOBECOM), (Orlando, Florida), pp. 1369--1373
- (40.02), IEEE, Dec. 1992.
-
- [16] G. J. Holzmann, Design and Validation of Computer Protocols. Englewood
- Cliffs, New Jersey: Prentice Hall, 1991.
-
- [17] A. Nakassis, ``Fletcher's error detection algorithm: how to implement
- it efficiently and how to avoid the most common pitfalls,'' ACM
- Computer Communication Review, vol. 18, pp. 63--88, Oct. 1988.
-
- [18] J. G. Fletcher, ``An arithmetic checksum for serial transmission,''
- IEEE Transactions on Communications, vol. COM-30, pp. 247--252, Jan.
- 1982.
-
- [19] J. Linn, ``Privacy enhancement for Internet electronic mail: Part III
- --- algorithms, modes and identifiers,'' Network Working Group Request
- for Comments RFC 1115, IETF, Aug. 1989.
-
- [20] D. Balenson, ``Privacy enhancement for internet electronic mail: Part
- III: Algorithms, modes, and identifiers,'' Network Working Group
- Request for Comments RFC 1423, IETF, Feb. 1993.
-
- [21] S. Kent, ``Privacy enhancement for internet electronic mail: Part II:
- Certificate-based key management,'' Network Working Group Request for
- Comments RFC 1422, IETF, Feb. 1993.
-
- H. Schulzrinne Expires 03/01/94 [Page 63]
- INTERNET-DRAFT draft-ietf-avt-issues-01.txt October 20, 1993
-
- [22] J. Linn, ``Privacy enhancement for Internet electronic mail: Part
- I --- message encipherment and authentication procedures,'' Network
- Working Group Request for Comments RFC 1113, IETF, Aug. 1989.
-
- [23] R. Rivest, ``The MD5 message-digest algorithm,'' Network Working Group
- Request for Comments RFC 1321, IETF, Apr. 1992.
-
- [24] North American Directory Forum, ``A naming scheme for c=US,'' Network
- Working Group Request for Comments RFC 1255, North American Directory
- Forum, Sept. 1991.
-
- [25] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood
- Cliffs, New Jersey: Prentice Hall, 1984.
-
- [26] P. T. Brady, ``A model for generating on-off speech patterns in
- two-way conversation,'' Bell System Technical Journal, vol. 48,
- pp. 2445--2472, Sept. 1969.
-
- [27] J. Postel, ``Internet protocol,'' Network Working Group Request for
- Comments RFC 791, Information Sciences Institute, Sept. 1981.
-
- [28] R. Cole, ``PVP - a packet video protocol,'' W-Note 28, Information
- Sciences Institute, University of Southern California, Los Angeles,
- California, Aug. 1981.
-
- [29] C. Topolcic, S. Casner, C. Lynn, Jr., P. Park, and K. Schroder,
- ``Experimental internet stream protocol, version 2 (ST-II),'' Network
- Working Group Request for Comments RFC 1190, BBN Systems and
- Technologies, Oct. 1990.
-
- [30] C. Topolcic, ``ST II,'' in First International Workshop on Network and
- Operating System Support for Digital Audio and Video, no. TR-90-062 in
- ICSI Technical Reports, (Berkeley, California), 1990.
-
- [31] J. B. Postel, ``DoD standard transmission control protocol,'' Network
- Working Group Request for Comments RFC 761, Information Sciences
- Institute, Jan. 1980.
-
- [32] J. B. Postel, ``User datagram protocol,'' Network Working Group
- Request for Comments RFC 768, ISI, Aug. 1980.
-
- [33] D. R. Cheriton, ``VMTP: Versatile Message Transaction Protocol
- specification,'' in Network Information Center RFC 1045, (Menlo Park,
- California), pp. 1--123, SRI International, Feb. 1988.
-
-
-
-
-
-
-
-
- H. Schulzrinne Expires 03/01/94 [Page 64]
-